LSM Trees

From

Book

Designing Data-Intensive Applications > Ch3 Storage:

Writes are added to an in-memory sorted structure (memtable).
- Write this out to disk as an SSTable when it gets too big and start with an empty memtable.
- Compact & merge the SSTables offline.
Reads start with the memtable and then go up the (stack of) SSTables in recency order.
Reading an absent key has to traverse the entire tree (reading each SSTable from disk)
- Implementations may use Bloom Filters to fast-exit in this case.
Need to write to a WAL before a write is successful to guard against crashes before the memtable is flushed.
Used in LevelDB/RocksDB.

Book

Designing Data-Intensive Applications hand-waves away the actual implementation details of data layout and compaction strategies (what makes an LSM-Tree a Tree, essentially); here’s more info from the RocksDB wiki:

Compaction Strategies

https://github.com/facebook/rocksdb/wiki/Compaction

Leveled
- SSTables are stored in a number of levels.
- Each level (except level 0) is a single sorted run that is split into many files.
- Each level is many times larger than the previous level.
- Data from the memtable is flushed to level-0; this level necessarily contains SSTables with overlapping keys
- When level-0 becomes larger than a threshold, all its SSTables are merged into all the SSTables from level 1 that contain overlapping keys.
- The new SSTables thus formed along with SSTables from level 1 that weren’t part of the merge become the new level 1.
- This may happen recursively, so write amplification is a concern.
Leveled-N
- It allows more than one sorted run per level.
- Compaction merges all sorted runs from level $L{n-1}$ into one sorted run from $Ln$, which is leveled.
- More read amplication, lower write amplification.
Tiered
- Compaction merges all sorted runs in one level to create a new sorted run in the next level.
- Each level has $N$ sorted runs. Each sorted run is $N$ times larger than sorted runs in the previous level.
- Compaction does not read/rewrite sorted runs in Ln when merging into Ln.
- If I understand currently, this is different from Leveled-N in that destination levels only have new SSTables attached, but never rewritten.
- Tiered compaction minimizes write amplification at the cost of read and space amplification.
- This is an interesting point that I need to think about some more:
  
  A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.
Tiered+Leveled
- Lower levels are leveled, and higher levels are tiered
- Optimize for lower read amplification + higher write amplification at lower levels and the opposite at higher levels
- Lower levels are read more often than higher levels and are smaller, so write amplification has a smaller impact

Compaction Strategies

Links/Resources