Tiling and Memory Hierarchy
August 21, 2025
Moving from naive loops to kernels that actually saturate CPUs and GPUs.
Who is this for?
Engineers who know how to writeC = A @ Bbut want to understand why the naive triple loop is much slower than tiling.
TL;DR
- Memory is the bottleneck, not compute.
- Tiling (a.k.a. blocking) reuses cache lines and shared memory instead of reloading from main memory.
- Good tiling → more locality → higher arithmetic intensity.
- ML compilers (TVM, XLA, Triton) automate this, but the principles are simple and general.
The Naive Matrix Multiply
python-snippet1import numpy as np 2import time 3 4def matmul_naive(A, B): 5 M, K = A.shape 6 K, N = B.shape 7 C = np.zeros((M, N)) 8 for i in range(M): 9 for j in range(N): 10 for k in range(K): 11 C[i, j] += A[i, k] * B[k, j] 12 return C 13 14M = N = K = 512 15A, B = np.random.rand(M, K), np.random.rand(K, N) 16 17start = time.time() 18C = matmul_naive(A, B) 19print("Naive:", time.time() - start, "sec")
This works, but it is a much slower solution than np.dot, because every access to A[i, k] and B[k, j] potentially misses cache and goes all the way to DRAM.
Memory Hierarchy
Modern CPUs/GPUs have layers of memory:
- Registers (fastest, few values)
- L1 cache (~32KB per core, ~1 ns latency)
- L2/L3 cache (MBs, ~10–30 ns)
- Main memory (DRAM) (GBs, ~100 ns)
- GPU High bandwith memory (fast, but still ~100s ns)
Naive loops jump around in memory so much that data is evicted before it’s reused.
Tiling
Instead of computing one element of C at a time, we work on blocks of A, B, and C that fit into cache.
python-snippet1def matmul_tiled(A, B, tile=32): 2 M, K = A.shape 3 K, N = B.shape 4 C = np.zeros((M, N)) 5 for i in range(0, M, tile): 6 for j in range(0, N, tile): 7 for k in range(0, K, tile): 8 # Compute one tile of C 9 i_max, j_max, k_max = min(i+tile, M), min(j+tile, N), min(k+tile, K) 10 for ii in range(i, i_max): 11 for jj in range(j, j_max): 12 for kk in range(k, k_max): 13 C[ii, jj] += A[ii, kk] * B[kk, jj] 14 return C
By keeping a tile x tile chunk hot in cache, each loaded value is reused many times before eviction.
Cache Reuse
Imagine multiplying 512×512 matrices:
- Naive: each element of
AandBis reloaded ~512 times. - Tiled (tile=32): each element is loaded once per block → 16× fewer memory loads.
text-snippet1Naive: A[i,k] → used once, evicted 2Tiled: A[i,k..k+31] → reused 32 times inside cache
On GPUs: Shared Memory as a Cache
GPUs don’t have big caches like CPUs. Instead, kernels use shared memory (programmable L1) for tiling.
- Threads load tiles of
AandBinto shared memory. - Synchronize.
- Multiply + accumulate inside registers.
- Move to next tile.
This is the classic CUDA “tiled matmul” kernel you’ll see in every tutorial.
Arithmetic Intensity
Performance depends on compute per byte moved. Arithmetic intensity refers to the ratio of floating-point operations to memory access in a computational task
- Naive matmul: low intensity (lots of memory traffic).
- Tiled matmul: high intensity (reuse data many times).
- cuBLAS: adds vectorization, prefetching, register blocking → squeezes every drop.
Arithmetic intensity is used as a metric in models for performance.
Tiling in ML Compilers
When you write:
python-snippet1y = model(x)
your framework eventually lowers to kernels where tiling decisions matter:
- TVM / Halide search different tile sizes (“schedules”).
- Triton lets you pick block sizes explicitly.
- cuDNN/cuBLAS have hand-tuned tiling strategies per GPU.
The principle is always the same: move data less, reuse more.
Takeaways
- Memory > FLOPs: Performance is usually about memory movement, not raw compute.
- Tiling = locality: Work on small blocks that fit into fast memory.
- Frameworks automate it: But understanding it helps you reason about why
matmulscales the way it does.