Single leader per partition. One replica per partition per AZ.
“The leader is always up to date” (?)
GetItem: The “request router” performs auth, and then reads from a storage node that contains data for that partition. This is an “eventually consistent read”.
PutItem: The router writes to the leader (for the partition), which replicates to two other nodes, waiting for at least one ack.
Thousands of request routers, thousands of storage nodes. Multi-AZ.
Request routers are stateless.
Request routers call out to the “Partition Metadata System” to find the right leader for a partition.
Each Dynamo table has a primary hash key. Range partitions based on the hash.
Storage nodes have a B-tree with a WAL.
“Auto Admin” is a control plane service that:
Detects node failure and transfers partitions from that node onto new nodes
Updates the “partition metadata system”
Secondary indexes are partitioned, but separately from the base table.
A “Log Propagator” watches the WAL on each node and updates secondary indexes. A change to the key that the secondary index is based on could move the index to a new partition.
Write amplication. 5 secondary indices are allowed, so in the worst case a single PUT can touch 33 different storage nodes (2 nodes for each index move, 1 node for the initial put, times 3 for replication).
Dynamo tables can have provisioned table capacity; RCUs and WCUs (per second)
Capacity units are sized at 4kB, so operations on larger rows might need more than one per
Dynamo splits up the capacity units evenly between partitions
Token bucket sized at 300x the RCU/WCU value → the bucket fills up in 5 minutes when idle
Adaptive Capacity helps balance uneven load between partitions. Set a multiplier for the rate at which the bucket is refilled dynamically using a PID Controller. Doesn’t change the amount you’re billed.
Auto Scaling does the same thing but works for sustained spikes; bills you more. Cloudwatch is used to trigger changes.
Didn’t go into what this translates to under the hood. Is a table replicated onto more nodes when more RCUs are assigned to it? How are writes scaled? A larger number of smaller partitions?
Table restore
Backups are stored in S3
WALs are written out (& reclaimed) directly when they get too large
The B-tree is serialized out as well at a slower interval, but this is done “online”, so this isn’t strictly consistent (and isn’t a snapshot)
To restore to a point in time, you go back to the latest B-tree “snapshot” before that time, and then replay logs until that point. How does this get around inconsistency within the B-Tree backup?
Auto-admin splits up partitions when they become too large, which makes restores more complicated. Solved by filtering by key-range per split partition.
On-demand backups are similar, but this first triggers a request to all storage nodes to upload their WALs; the backup is taken once all WALs for that instant (and before) have made it to S3.
Streams
All mutations to a table are put in a queue (SQS?) with no duplicates
Operations for a given key are always in order (per-partition ordering)
Storage nodes write (async) directly to a stream shard. There’s some notion of a successful write, so storage nodes can wait for stream infra to failover if necessary and then catch up.
Average latency is tens of ms.
Global Tables
Multi-master, cross-region
This service is external to DynamoDB for the most part
Each region streams to ship changes across the wire to all other regions
A bit more complexity here to handle changing numbers of shards in the stream when partition counts change
Conflict resolution
LWW
Timestamps are millisecond-resolution, with an extra three digits that correspond to an operation number for that millisecond. This implies that 1000 operations / ms is a hard limit.
The region ID is used to break ties
How is timestamp drift dealt with?
Questions
What is the actual replication factor? How many acks?
During a write, what happens if the ack that the leader didn’t wait for is lost? How does the remaining node catch up?
How are the “Auto Admin”, “Metadata Service”, and “Log Propagator” services made fault-tolerant?