Amazon DynamoDB Under the Hood


  • DynamoDB evolved from the Dynamo paper (Really?
    Book
    DDIA says they’re completely different)
  • Uses Paxos instead of quorum reads & writes.
  • Single leader per partition. One replica per partition per AZ.
  • “The leader is always up to date” (?)
  • GetItem: The “request router” performs auth, and then reads from a storage node that contains data for that partition. This is an “eventually consistent read”.
  • PutItem: The router writes to the leader (for the partition), which replicates to two other nodes, waiting for at least one ack.
  • Thousands of request routers, thousands of storage nodes. Multi-AZ.
  • Request routers are stateless.
  • Request routers call out to the “Partition Metadata System” to find the right leader for a partition.
  • Each Dynamo table has a primary hash key. Range partitions based on the hash.
  • Storage nodes have a B-tree with a WAL. Screen Shot 2021-07-13 at 5.49.38 PM.png
  • “Auto Admin” is a control plane service that:
    • Detects node failure and transfers partitions from that node onto new nodes
    • Updates the “partition metadata system”
  • Secondary indexes are partitioned, but separately from the base table.
    • A “Log Propagator” watches the WAL on each node and updates secondary indexes. A change to the key that the secondary index is based on could move the index to a new partition.
    • Write amplication. 5 secondary indices are allowed, so in the worst case a single PUT can touch 33 different storage nodes (2 nodes for each index move, 1 node for the initial put, times 3 for replication).
  • Dynamo tables can have provisioned table capacity; RCUs and WCUs (per second)
    • Capacity units are sized at 4kB, so operations on larger rows might need more than one per
    • Dynamo splits up the capacity units evenly between partitions
    • Token bucket sized at 300x the RCU/WCU value → the bucket fills up in 5 minutes when idle
    • Adaptive Capacity helps balance uneven load between partitions. Set a multiplier for the rate at which the bucket is refilled dynamically using a PID Controller. Doesn’t change the amount you’re billed.
    • Auto Scaling does the same thing but works for sustained spikes; bills you more. Cloudwatch is used to trigger changes.
    • Didn’t go into what this translates to under the hood. Is a table replicated onto more nodes when more RCUs are assigned to it? How are writes scaled? A larger number of smaller partitions?
  • Table restore
    • Backups are stored in S3
    • WALs are written out (& reclaimed) directly when they get too large
    • The B-tree is serialized out as well at a slower interval, but this is done “online”, so this isn’t strictly consistent (and isn’t a snapshot)
    • To restore to a point in time, you go back to the latest B-tree “snapshot” before that time, and then replay logs until that point. How does this get around inconsistency within the B-Tree backup?
    • Auto-admin splits up partitions when they become too large, which makes restores more complicated. Solved by filtering by key-range per split partition.
    • On-demand backups are similar, but this first triggers a request to all storage nodes to upload their WALs; the backup is taken once all WALs for that instant (and before) have made it to S3.
  • Streams
    • All mutations to a table are put in a queue (SQS?) with no duplicates
    • Operations for a given key are always in order (per-partition ordering)
    • Storage nodes write (async) directly to a stream shard. There’s some notion of a successful write, so storage nodes can wait for stream infra to failover if necessary and then catch up.
    • Average latency is tens of ms.
  • Global Tables
    • Multi-master, cross-region
    • This service is external to DynamoDB for the most part
    • Each region streams to ship changes across the wire to all other regions
    • A bit more complexity here to handle changing numbers of shards in the stream when partition counts change
    • Conflict resolution
      • LWW
      • Timestamps are millisecond-resolution, with an extra three digits that correspond to an operation number for that millisecond. This implies that 1000 operations / ms is a hard limit.
      • The region ID is used to break ties
      • How is timestamp drift dealt with?

Questions

  • What is the actual replication factor? How many acks?
  • During a write, what happens if the ack that the leader didn’t wait for is lost? How does the remaining node catch up?
  • How are the “Auto Admin”, “Metadata Service”, and “Log Propagator” services made fault-tolerant?
Edit