Inside a Tropical GEMM Kernel

Tiles, shared memory, and an accumulator loop with no FMA

A tiled (min, +) matrix multiply has exactly the loop structure of ordinary GEMM — watch it execute, one shared-memory tile at a time.

Same loops, different primitive. acc ← min(acc, a + b) — an add and a compare, with no fused multiply-add to lean on.

Playback

Speed

One thread block at work (12×12 matrices, 4×4 tiles)

Current step

Why this matters for hardware

Tropical GEMM reuses the loop structure and blocking ideas of ordinary matrix multiplication: tiled access, shared-memory staging, independent output tiles. It does not reuse the arithmetic primitive: the inner step is an addition followed by a comparison — no FMA fuses it, and tensor cores do not execute (min, +). Semiring kernels run on the GPU's general-purpose compute: still massively parallel, but without the dedicated silicon that inflates ordinary GEMM numbers. Measurements on a laptop GPU accompany the paper (Section 9).