What Every Programmer Should Know About SSDs

https://databasearchitects.blogspot.com/2021/06/what-every-programmer-should-know-about.html?m=1

SSDs are more complicated and their performance behavior can appear quite mysterious if one simply thinks of them as fast disks.

  • SSDs are 100x faster for reads (100µs vs. 10ms), writes are even faster (10µs)

  • SSD storage medium is a flash chip; dozens of FCs, concurrent reads

  • Writes are split across FCs, and a hardware prefetcher keeps multiple FCs busy during a sequential read

    • How is the data split across FCs? Striped or simply split in order?
  • Multiple FCs must be kept busy “manually” during random reads by issuing as many random reads as there are FCs

    • The post implies that libaio and io_uring handle this automatically, but:
    • Is this support customized based on the number of FCs on a specific SSD?
  • SSDs have a volatile write cache that makes writes appear fast, but an actual persistent write is as slow as 1ms

    • Server-grade SSDs provide battery persistence for the write cache so a flush isn’t strictly required
    • Is Linux smart about this or do writes go to RAM first before the on-SSD cache?
  • Writes can be parallelized across FCs to keep overall throughput high, but writes don’t parallelize as well as reads:

    because a write occupies a flash chip 10 times longer than a read, writes cause significant tail latencies for reads to the same flash chip.

  • Pages cannot be overwritten; new pages are appended to blocks that were erased. Blocks contain hundreds of pages, so it isn’t tenable to erase an entire block whenever a page needs to be rewritten.

    • SSDs handle overwrites/rewrites by writing a new version of the page to a new location/block, and storing a mapping from logical address -> physical address
    • A garbage collector erases blocks when no erased blocks exist; orphaned pages in the blocks being erased disappear, and all other pages are rewritten into the beginning of the block, (ideally) leaving space for more pages to be appended.
    • As an example, here’s the before state: 400
    • And after Block 0 is erased to make room for new pages: 400
  • Ergo, write amplification - P0 is rewritten even though there isn’t a logical reason to do so

    • Sequential writes have lower amplification (and are presumably faster in high percentiles) than random writes
    • When a block is erased, write amplification is zero if all the pages in it can be deleted. The amount of amplification (WA) increases as the % of pages that can’t be deleted (f) increases: 500
Edit