6.824 Lecture 7 - Fault Tolerance with Raft - Part 2

Election: the leader isn’t necessarily the server with the longest log
- The important thing is to not throw away committed entries
- The server with the longest log might contain a large tail of uncommitted entries
- Servers with shorter logs might contain committed entries that the long server doesn’t
- Here S1 can’t be allowed to become leader because 8 is committed
Election Restriction
- RequestVote only succeeds if the candidate’s log is at leasst as up to date as the voting server’s log
- In addition to the other restrictions (already voted, wrong term)
Rollback doesn’t have to happen one entry at a time
- Followers can send enough info to the leader so it can roll back more precisely
  - XTerm - term with conflicting entries
  - XIndex - first index with term XTerm
  - XLen - length of the (follower’s) log
- With those two pieces of data, leaders can roll things back a term at a time
- Other strategies are possible too (binary search, etc.), but this is the only one the paper hints at
- Here’s an example (assume S2 is the leader and S1 is catching up):
  - Case 1: S1 sends XTerm = 5. S2 doesn’t know this term, and so backs up to the final entry for the preceding term (4).
  - Case 2: S1 sends XTerm = 4. S2 has this term but not at that index, so backs up to the most recent 4 it shares with S1.
  - Case 3: S1 has nothing at the given index, so S2 backs up to S1’s final index.
A rebooted server rebuilds its state by replaying the log from the beginning. Snapshots make this better, but the core assumption is that a server reboot causes associated application state to be rebuilt from scratch.
Persistence
- Likely to be the perf bottleneck
- HD > SSD > Battery-backed DRAM
- write isn’t sufficient (page cache), write + fsync
- commitIndex and lastApplied can be volatile because they can be determined with the persisted state. Persisting them hurts perf with no real improvement to correctness. *
  
  commitIndex is volatile because Raft can figure out a correct value for it after a reboot using just the persistent state. Once a leader successfully gets a new log entry committed, it knows everything before that point is also committed. A follower that crashes and comes back up will be told about the right commitIndex whenever the current leader sends it an AE.
  - lastApplied starts at zero after a reboot because the basic Raft algorithm assumes the service (e.g., a key/value database) doesn’t keep any persistent state. Thus its state needs to be completely recreated by replaying all log entries. This is rather inefficient, of course, so many optimization ideas are possible.
Compaction / snapshots
- Application state is typically a lot smaller than the associated log
- All Raft nodes periodically ask the application for a snapshot of its state as of a certain point in the log
  - Is this commitIndex?
  - Does Raft enforce any structure or is the snapshot just a byte stream?
- Each node then persists this snapshot and throws away the log up to that point
- Followers that are behind the leader’s snapshot point receive InstallSnapshot RPCs instead of AppendEntries to get them up to date
- This affects modularity, because applications:
  - Need to generate snapshots on-demand
  - Need to replace their state entirely and absorb a snapshot when a new InstallSnapshot comes in
Correctness == Linearizability in the context of this lab
- A total order exists for all events in a linearizable system
  - If two events are non-overlapping, the earlier one must come first in the order
  - If a read sees the value from a write, the write must come first in the order
  - Every read sees the most recent write in the ordering
- Here’s the relevant section from
  Book
  Designing Data-Intensive Applications > Ordering