Cache Efficient Algorithms

1. Algorithmic complexity: Memory Transfers
2. An Example
3. Definition
4. Some Cache Oblivious Algorithms
5. Techniques for writing Cache Efficient Code
6. Other Resources
7. Bibliography

1. Algorithmic complexity: Memory Transfers

Generally while designing algorithms, we care about the Time and space complexity of the algorithm. But there is another important aspect we have to look at: The number of Memory Transfers.

We know that, CPUs can directly access data only from the L1 cache. So the number of times we have to transfer data from Main Memory to the L1 cache affects the runtime of our algorithms. We can express this using the asymptotic memory transfer complexity of our algorithm.

For example, for summing all elements of array, we have to transfer \(N/B\) blocks from memory plus an additional 1 block depending on array alignment, where \(B\) is cache line size or block size, i.e. the number of words transfered at a time from Memory. So the memory transfer complexity is

\begin{align*} MT(N) = O(N/B) \end{align*}

We can use this measure to compare different algorithms on how efficiently they use the cache.

2. An Example

Lets look at a concrete example.

I have created a matrix with each row having fews elements such that they fit in one cach block but the number of rows is such that the whole matrix is much bigger than my L1 cache. The task is to find the sum of all elements of the matrix.

Now, we can find the sum in row order i.e. going through rows one by one. This way, when first element of row is loaded in L1 cache, rest of the elements of that row will also be loaded. This is cache optimal way and takes about 0.8 seconds for this particular example.

In contrast, if we do this in column order, then when a row is loaded in cache, only one element is used. And later when going through another column the same row will be loaded again because our matrix too big and thus that row would have already been evicted due to capacity miss. This is very bad in terms of cache efficiency. And it takes about 2.8 seconds for the same example. This is about 3.5 times slower.

Note that both cases have same asymptotic Time complexity but their execution time differ by 3.5x because of the way they use cache. Thus, along with time complexity, memory transfer complexity is also a crucial aspect for designing fast practical algorithms.

3. Definition

Algorithms that take into account that there is a Memory Hierarchy and try to reduce memory transfers are cache efficient algorithms.

There are two major categories:

Cache Aware Algorithms: Algorithms optimized for specific cache of the machine. i.e. optimized for specific block size and total cache size of the machine.
Cache (size) Oblivious Algorithms: Algorithms that don’t depend on specific parameters of cache. Thus they are more portable, but maybe not as optimized as Cache Aware Algorithms.

In any case, cache efficient algorithms improve the execution time and also save energy.

Theorem: If your algorithm is efficient on two levels of memory hierarchy and it is cache oblivious then it is efficient in any number of levels.

This means, if you prove a cache oblivious algorithm to be efficient when there is Main memory and L1 cache, then it is automatically proven to be efficient when L3/L2 caches exists between main memory and L1 cache.

4. Some Cache Oblivious Algorithms

Lets see some cache efficient algorithms, some of which we might be familiar with:

Array reduction is cache optimal

E.g. Summing all elements of an array is cache optimial if we go through the elements in consequtive manner.
Similarly reversing an array by using two pointers that goes through front and back is also cache optimal.
Infact algorithms in which we scan the array using fixed number of pointers is cache optimal. In other words, constant number of parallel scans is cache optimal.
Matrix Multiplication done using divide and conquer.

If we do matrix matrix multiplication in the naive way then it won't be cache optimal. By naive way, I mean computing each element \(C_{ij}\) of the product one by one using the formula:
\begin{align*} C_{ij} = \sum_k A_{ik} B_{kj} \end{align*}
But if we do matrix multiplication using divide and conque algorithms like Strassen's algorith, where matrix is divided into blocks and matrix product is expressed in terms of product of those blocks then we have better cache efficiency. This happens, because after some division steps the blocks become so small that the matrices to multiply and the product, all fit inside the cache.

And when multiplication of small blocks becomes cache optimal, the product of bigger matrix obtained by composing such small products also becomes cache efficient.
Funnel Sort (2): It is an cache efficient variant of Merge Sort. See Cache-Oblivious Algorithms - Harald Prokop - Section 4
Buffer Tree (1): It is an cache efficient way to maintain priority queues. See Cache-Oblivious Algorithms and Data Structures - Section 5.3
Static Search Tree:

Binary Search takes O(lg N) steps and for each step of recursion it requires one memory transfer except for the last few nodes of the recursion which fit in block size. So, memory transfers is O(lg N - lg B)

B-Tree (usually used for database) can do this with just O(lg N / lg B). We can also modify binary search using a different layout that can achieve this optimal bound.

So instead of laying out elements in sorted order in an array, if we lay them out in van Emde Boas Layout, we get cache optimal comparision search. This is called Static Search Tree.

See rcho.me for a simple exposition. Or see Cache-Oblivious Algorithms and Data Structures [Section 4.1] for details.
Flash Attention: At the core of LLMs is the Transformer architecture with an attention block.
\begin{align*} \sigma \left( \frac{QK^T} {\sqrt{d_k}} \right) v \end{align*}

Attention consists of matrix matrix multiplication, some transformation and then matrix vector multiplication. And all practical uses of attention use some variant of Flash Attention which is designed in cache efficient way. In flash attention, matrix and vector operations are done in blocks such that they fit in the SRAM of GPU. Also multiple steps are fused together so that read and writes are reduced. As we can see from graph on the right, this gives a significant performance boost.

5. Techniques for writing Cache Efficient Code

There are some generic techniques for writing cache efficient code.

Divide and Conquer: Generally divide and conquer results in cache efficient code because the data gets recursively divided in such a way that after some divisions the whole sub problem fits in the cache memory.
Another strategy is using structure of array instead of array of structures.

For example in a game where there are entities with data like position, velocity, name, health, stats, etc. It can be better to keep positions in a single array. Because during physics update all position data are updated together, and other things like health, stats, are untouched. Thus fetching the whole player struct would be wasteful.
Measure your code performance:
- use perf in Linux
- use Instruments app in MacOS

6. Other Resources

7. Bibliography

[1] Demaine, Erik D, Cache-oblivious algorithms and data structures, Lecture Notes from the EEF Summer School on Massive Data Sets, 2002.

[2] Frigo, Matteo and Leiserson, Charles E and Prokop, Harald and Ramachandran, Sridhar, Cache-oblivious algorithms, 1999.