LSM Trees

From

Book
Designing Data-Intensive Applications > Ch3 Storage:

  • Writes are added to an in-memory sorted structure (memtable).
    • Write this out to disk as an SSTable when it gets too big and start with an empty memtable.
    • Compact & merge the SSTables offline.
  • Reads start with the memtable and then go up the (stack of) SSTables in recency order.
  • Reading an absent key has to traverse the entire tree (reading each SSTable from disk)
    • Implementations may use Bloom Filters to fast-exit in this case.
  • Need to write to a WAL before a write is successful to guard against crashes before the memtable is flushed.
  • Used in LevelDB/RocksDB.

Book
Designing Data-Intensive Applications hand-waves away the actual implementation details of data layout and compaction strategies (what makes an LSM-Tree a Tree, essentially); here’s more info from the RocksDB wiki:

Compaction Strategies

https://github.com/facebook/rocksdb/wiki/Compaction

  • Leveled

    • 500
    • SSTables are stored in a number of levels.
    • Each level (except level 0) is a single sorted run that is split into many files.
    • Each level is many times larger than the previous level.
    • Data from the memtable is flushed to level-0; this level necessarily contains SSTables with overlapping keys
    • When level-0 becomes larger than a threshold, all its SSTables are merged into all the SSTables from level 1 that contain overlapping keys.
    • The new SSTables thus formed along with SSTables from level 1 that weren’t part of the merge become the new level 1.
    • This may happen recursively, so write amplification is a concern.
  • Leveled-N

    • It allows more than one sorted run per level.
    • Compaction merges all sorted runs from level $L{n-1}$ into one sorted run from $Ln$, which is leveled.
    • More read amplication, lower write amplification.
  • Tiered

    • Compaction merges all sorted runs in one level to create a new sorted run in the next level.
    • Each level has $N$ sorted runs. Each sorted run is $N$ times larger than sorted runs in the previous level.
    • Compaction does not read/rewrite sorted runs in Ln when merging into Ln.
    • If I understand currently, this is different from Leveled-N in that destination levels only have new SSTables attached, but never rewritten.
    • Tiered compaction minimizes write amplification at the cost of read and space amplification.
    • This is an interesting point that I need to think about some more:

      A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.

  • Tiered+Leveled

    • Lower levels are leveled, and higher levels are tiered
    • Optimize for lower read amplification + higher write amplification at lower levels and the opposite at higher levels
    • Lower levels are read more often than higher levels and are smaller, so write amplification has a smaller impact

Links/Resources

Edit