Tiling and Memory Hierarchy

Moving from naive loops to kernels that actually saturate CPUs and GPUs.

Who is this for?
Engineers who know how to write C = A @ B but want to understand why the naive triple loop is much slower than tiling.

TL;DR

Memory is the bottleneck, not compute.
Tiling (a.k.a. blocking) reuses cache lines and shared memory instead of reloading from main memory.
Good tiling → more locality → higher arithmetic intensity.
ML compilers (TVM, XLA, Triton) automate this, but the principles are simple and general.

The Naive Matrix Multiply

python-snippet
1import numpy as np
2import time
3
4def matmul_naive(A, B):
5    M, K = A.shape
6    K, N = B.shape
7    C = np.zeros((M, N))
8    for i in range(M):
9        for j in range(N):
10            for k in range(K):
11                C[i, j] += A[i, k] * B[k, j]
12    return C
13
14M = N = K = 512
15A, B = np.random.rand(M, K), np.random.rand(K, N)
16
17start = time.time()
18C = matmul_naive(A, B)
19print("Naive:", time.time() - start, "sec")

This works, but it is a much slower solution than np.dot, because every access to A[i, k] and B[k, j] potentially misses cache and goes all the way to DRAM.

Memory Hierarchy

Modern CPUs/GPUs have layers of memory:

Registers (fastest, few values)
L1 cache (~32KB per core, ~1 ns latency)
L2/L3 cache (MBs, ~10–30 ns)
Main memory (DRAM) (GBs, ~100 ns)
GPU High bandwith memory (fast, but still ~100s ns)

Naive loops jump around in memory so much that data is evicted before it’s reused.

Tiling

Instead of computing one element of C at a time, we work on blocks of A, B, and C that fit into cache.

python-snippet
1def matmul_tiled(A, B, tile=32):
2    M, K = A.shape
3    K, N = B.shape
4    C = np.zeros((M, N))
5    for i in range(0, M, tile):
6        for j in range(0, N, tile):
7            for k in range(0, K, tile):
8                # Compute one tile of C
9                i_max, j_max, k_max = min(i+tile, M), min(j+tile, N), min(k+tile, K)
10                for ii in range(i, i_max):
11                    for jj in range(j, j_max):
12                        for kk in range(k, k_max):
13                            C[ii, jj] += A[ii, kk] * B[kk, jj]
14    return C

By keeping a tile x tile chunk hot in cache, each loaded value is reused many times before eviction.

Cache Reuse

Imagine multiplying 512×512 matrices:

Naive: each element of A and B is reloaded ~512 times.
Tiled (tile=32): each element is loaded once per block → 16× fewer memory loads.

text-snippet
1Naive:   A[i,k]        → used once, evicted
2Tiled:   A[i,k..k+31]  → reused 32 times inside cache

On GPUs: Shared Memory as a Cache

GPUs don’t have big caches like CPUs. Instead, kernels use shared memory (programmable L1) for tiling.

Threads load tiles of A and B into shared memory.
Synchronize.
Multiply + accumulate inside registers.
Move to next tile.

This is the classic CUDA “tiled matmul” kernel you’ll see in every tutorial.

Arithmetic Intensity

Performance depends on compute per byte moved. Arithmetic intensity refers to the ratio of floating-point operations to memory access in a computational task

Naive matmul: low intensity (lots of memory traffic).
Tiled matmul: high intensity (reuse data many times).
cuBLAS: adds vectorization, prefetching, register blocking → squeezes every drop.

Arithmetic intensity is used as a metric in models for performance.

Tiling in ML Compilers

When you write:

python-snippet
1y = model(x)

your framework eventually lowers to kernels where tiling decisions matter:

TVM / Halide search different tile sizes (“schedules”).
Triton lets you pick block sizes explicitly.
cuDNN/cuBLAS have hand-tuned tiling strategies per GPU.

The principle is always the same: move data less, reuse more.

Takeaways

Memory > FLOPs: Performance is usually about memory movement, not raw compute.
Tiling = locality: Work on small blocks that fit into fast memory.
Frameworks automate it: But understanding it helps you reason about why matmul scales the way it does.