Tiles, shared memory, and an accumulator loop with no FMA
A tiled (min, +) matrix multiply has exactly the loop structure of ordinary GEMM — watch it execute, one shared-memory tile at a time.
Same loops, different primitive. acc ← min(acc, a + b) — an add and a compare, with no fused multiply-add to lean on.
Tropical GEMM reuses the loop structure and blocking ideas of ordinary matrix multiplication: tiled access, shared-memory staging, independent output tiles. It does not reuse the arithmetic primitive: the inner step is an addition followed by a comparison — no FMA fuses it, and tensor cores do not execute (min, +). Semiring kernels run on the GPU's general-purpose compute: still massively parallel, but without the dedicated silicon that inflates ordinary GEMM numbers. Measurements on a laptop GPU accompany the paper (Section 9).