6.824 Lecture 7 - Fault Tolerance with Raft - Part 2
- Video: https://www.youtube.com/watch?v=4r8Mz3MMivY
- FAQ: http://nil.lcs.mit.edu/6.824/2020/papers/raft2-faq.txt
-
Election: the leader isn’t necessarily the server with the longest log
-
Election Restriction
RequestVote
only succeeds if the candidate’s log is at leasst as up to date as the voting server’s log- In addition to the other restrictions (already voted, wrong term)
-
Rollback doesn’t have to happen one entry at a time
- Followers can send enough info to the leader so it can roll back more precisely
- XTerm - term with conflicting entries
- XIndex - first index with term XTerm
- XLen - length of the (follower’s) log
- With those two pieces of data, leaders can roll things back a term at a time
- Other strategies are possible too (binary search, etc.), but this is the only one the paper hints at
- Here’s an example (assume S2 is the leader and S1 is catching up):
- Case 1: S1 sends XTerm = 5. S2 doesn’t know this term, and so backs up to the final entry for the preceding term (4).
- Case 2: S1 sends XTerm = 4. S2 has this term but not at that index, so backs up to the most recent 4 it shares with S1.
- Case 3: S1 has nothing at the given index, so S2 backs up to S1’s final index.
- Followers can send enough info to the leader so it can roll back more precisely
-
A rebooted server rebuilds its state by replaying the log from the beginning. Snapshots make this better, but the core assumption is that a server reboot causes associated application state to be rebuilt from scratch.
-
Persistence
-
Likely to be the perf bottleneck
-
HD > SSD > Battery-backed DRAM
-
write
isn’t sufficient (page cache),write
+fsync
-
commitIndex
andlastApplied
can be volatile because they can be determined with the persisted state. Persisting them hurts perf with no real improvement to correctness. *commitIndex
is volatile because Raft can figure out a correct value for it after a reboot using just the persistent state. Once a leader successfully gets a new log entry committed, it knows everything before that point is also committed. A follower that crashes and comes back up will be told about the right commitIndex whenever the current leader sends it an AE.-
lastApplied
starts at zero after a reboot because the basic Raft algorithm assumes the service (e.g., a key/value database) doesn’t keep any persistent state. Thus its state needs to be completely recreated by replaying all log entries. This is rather inefficient, of course, so many optimization ideas are possible.
-
-
-
Compaction / snapshots
- Application state is typically a lot smaller than the associated log
- All Raft nodes periodically ask the application for a snapshot of its state as of a certain point in the log
- Is this
commitIndex
? - Does Raft enforce any structure or is the snapshot just a byte stream?
- Is this
- Each node then persists this snapshot and throws away the log up to that point
- Followers that are behind the leader’s snapshot point receive
InstallSnapshot
RPCs instead ofAppendEntries
to get them up to date - This affects modularity, because applications:
- Need to generate snapshots on-demand
- Need to replace their state entirely and absorb a snapshot when a new
InstallSnapshot
comes in
-
Correctness == Linearizability in the context of this lab
- A total order exists for all events in a linearizable system
- If two events are non-overlapping, the earlier one must come first in the order
- If a read sees the value from a write, the write must come first in the order
- Every read sees the most recent write in the ordering
- Here’s the relevant section from BookDesigning Data-Intensive Applications > Ordering
- A total order exists for all events in a linearizable system