What Every Programmer Should Know About Memory

2. Commodity Hardware Today

  • Skimmed this because the details aren’t too relevant
  • CAS latency: time for column address selection
  • It’s much more effective to transmit multiple words at once (in the same row, or column, or either?), minimizing CAS and RAS signals.
  • A typical Intel 2-core process (for the time) has a clock ratio 11 times that of the FSB (which seems to be the interface between a processor and RAM). The processor “stalls” for 11 cycles every time the FSB stalls for a single cycle.

3. CPU Caches

  • L1 caches are split into two, for data (L1d) and instructions (L1i)
  • Hyperthreads on a core share all caches (each has a few registers specific to it, though)
  • Each CPU core has its own L1 cache but shares L2+ with other cores
  • Multiple processors don’t share any caches
  • Caches are a map: data keyed by their location in main memory. This key can grow too large if cache granularity is too small. This combined with the fact that RAM is more effective when multiple words are fetched at once leads to “lines” of data being stored in the cache, normally 64 bytes each.
  • A cache can’t hold partial lines. When memory is modified, the cache line is marked dirty.
  • AMD uses exclusive caches, with no overlap between each cache level. This improves space efficiency, but an L1 eviction might require an L2 and L3 eviction as well if they’re all full. Intel uses inclusive caches, where a cache line in L1 is also in L2, so an L1 eviction doesn’t have to move that line down to L2.
  • Cache coherency: a CPU can’t read or modify a cache line if it has been modified on another CPU’s cache.
  • Pentium M cache miss costs: /IMG_37BF96C6EBAC-1.jpeg
  • This graphs a workload that reads a bunch of memory in a loop. The amount of memory involved increases in the X-axis. /IMG_2C87CB234D0B-1.jpeg
    • There are three plateaus, starting at $2^10$, $2^13$, and $2^21$.
    • The latter two probably correspond to the sizes of the L1d and L2 caches, because it’s at that point that the processor has to start fetching from a slower store, pushing the cycle count up to a higher plateau.
    • A dataset that fits in L1 makes for 50x fewer cycles!

4. Virtual Memory

  • Virtual memory is implemented by the processor’s MMU; the OS just has to “fill in the page table data structures”. I didn’t realize this was mainly hardware driven!

    /880px-Virtual_address_space_and_physical_address_space_relationship.svg.png

  • The simplest design is to use the first N bytes of a virtual address as an index to find a particular page, and use the remaining bytes to seek to a specific address within that page. For 4kB pages you only need 12 bits for the offset, leaving 20-52 bits for the index.

    • You then store a mapping (called a directory) in main memory for this index mapped to the physical address of that page. This could be per-process though, so this wastes a lot of memory. /IMG_6AC90B3CE524-1.jpeg
    • Use multiple levels of indexes instead. Any directory that has no indexes yet doesn’t have to be allocated, making a sparse representation a lot more likely. Current CPUs use 4 levels. /IMG_2DD897145FCB-1.jpeg
    • This is still pretty slow because resolving an address can take up to 4 memory accesses. Storing the entire directory tree in L1 can be wasteful, and this would still take 4 L1 accesses, which is not ideal.
      • This is mitigated with a TLB (translation lookaside buffer) that caches the fully resolved physical address for a virtual address. Once cached, no more directory tree walks are required.
      • Multiple levels of TLBs are present, at each cache level.
      • The TLB is generally only useful in the context of a process, so what happens when a context switch happens? The easiest solution is to flush the TLB, but this is wasteful. A better solution is to allocate a few bits of the virtual address to a process identifier. This isn’t guaranteed to be unique, but can help reduce the number of TLB flushes.
    • x86 can use 2MB or 4MB pages if required, although the default is 4kB. It can be very tricky finding room to allocate large pages like this, which is why Linux requires that you boot with a flag enabling large pages, and a bunch of them are allocated before userspace is loaded.
      • Large pages mean the TLB cache (and the directory tree) is smaller, and fewer lookups are needed overall.
    • Modern processors have “shadow page tables” which allow virtualized operating systems to securely perform a VM mapping without invoking the VMM. Virtual address (inside VM) -> shadow physical address (inside VM) -> physical address (outside VM).
    • The cost of cache misses is higher in virtualized environments.

6. What Programmers Can Do

Write-heavy workloads where you aren’t reading your writes can be unnecessarily cached because writes go through a cache line. CPUs provide non-temporal write instructions to allow writing directly to memory without caching. Optimizations such as batching writes into a single cache line maybe be applied.

uncacheable memory, such as memory-mapped IO

Why is mmaped memory uncacheable? 🤔

Non-temporal reads are possible in more modern CPUs, where you can stream a bunch of sequential cache lines without putting them in the caches if you know you aren’t going to read them again or dirty them.

Sequential memory access can provide a big speedup. This is a matrix multiplication routine:

for (i = 0; i < N; ++i)
	for (j = 0; j < N; ++j)
		for (k = 0; k < N; ++k)
			res[i][j] += mul1[i][k] * mul2[k][j];

Iteration over mul1 happens sequentially, which is to say row-by-row. mul2 is processed column-by-column, though. One way to fix this is to transpose mul2 beforehand so it can be processed row-by-row. The transposition process is not free, but hopefully results in overall savings. And it does (for 1000x1000 matrices)! The transposed version uses ~75% fewer cycles.

/IMG_3F56A3E53E33-1.jpeg

There’s another “sub-matrix” optimization described that I didn’t understand at all. 😬 Using SIMD instructions makes the program use 91% fewer cycles than the original!


A struct like this:

struct foo {
  int a;         // 4 bytes
  long fill[7];  // 8 * 7 bytes
  int b;         // 4 bytes
}

should weigh exactly 64 bytes, and therefore should fit in a cache line. But it doesn’t because longs need to start at 8-byte boundaries. So this is actually represented as “4 bytes, <4 byte hole>, 56 bytes, 4 bytes”, overflowing the cache line. Reordering it so the int b comes before fill fixes this.

A cache line isn’t loaded all at once, so words earlier in the cache line are accessible faster. This implies:

  • Put critical parts of your structure at the beginning
  • If you’re accessing parts of your structure in a particular order, make the memory layout match this order

This only works when your structure as a whole is aligned to the beginning of a cache line, which is not guaranteed. memalign does this (I think). Alignment can provide speedups of 50-400% for various working set sizes, so optimizing this could be very effective.


Split large structures into multiple structures based on frequency of access. This makes it easier to cache data that’s accessed/dirtied often.

Edit