Blocking/Tiling is a well-known way to make program faster by leveraging the properties of memory hierarchy.

Take matrix multiplication as an example, multiplying two NxN matrices with the following code (assuming row-major):

```
for i := 1 to N do
for k := 1 to N do
r := A[i,k];
for j := 1 to N do
C[i,j] += r * B[k,j];
```

This code is performing good for several reasons:

- The innermost loop accesses consecutive data in
`B`

and`C`

, which utilizes the cache prefetch mechanism. - The same row of
`C`

is reused in each iteration of the middle loop

And if the matrices are small, `B`

can store entirely in the cache, then the
code can be fast. If the matrices are large, so that the cache cannot even hold
a row of data (`C`

), the cache is never reused. In this case (worst case), there
would be $2N^3+N^2$ elements read in $N^3$ iterations.

The blocked code is like the following:

```
for kk := 1 to N step b do {* N/b blocks *}
for jj := 1 to N step b do {* N/b blocks *}
for i := 1 to N do {* each col in C *}
for k := kk to min(kk+b-1, N) do {* computation on a block in B *}
r := A[i,k]; {* loop over a vector in A *}
for j := jj to min(jj+b-1,N) do
C[i,j] += r * B[k,j]; {* vector in C = vector in A x block in B *}
```

This code would be efficient if the block of `B`

of size bxb can reside inside the
cache, so that it can be reused throughout the loop on `i`

. Then the memory read
would be $2N^3/b+N^2$, reduced by a factor of $b$ at most. The optimal value (by
Hong & Kung, 1981) of blocking factor $b$ is roughly square root of $c$, where $c$ is
the size of the cache.

In practice of running these algorithms, the performance is not as good as expected because of the interference misses in the cache — because the cache is set associative.

The cache misses is higher than expected because there can be interferences:
there could be *cross interference* when two different variables conflicts each
other in cache, or *self interference* when elements of the same array variable
conflicts with itself in cache.

The paper elaborated that, the intrinsic cache miss is $N^3(2/b+4b/c)$, which is the best performance that can be achieved. So minimizing this yields blocking factor of $b=\sqrt{c/2}$. In case of a-way set associative cache is used, the optimal block size would be (heuristic) $b=\sqrt{c(a-1)/2a}$

## Bibliographic data

```
@inproceedings{
title = "The Cache Performance and Optimization of Blocked Algorithms",
author = "Monica S. Lam and Edward E. Rothberg and Michael E. Wolf",
booktitle = "Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems",
pages = "63--74",
howpublished = "ASPLOS",
year = "1991",
}
```