6.824 Lab 2 - Raft
- PaperRaft - In Search of an Understandable Consensus Algorithm
- http://nil.lcs.mit.edu/6.824/2020/labs/lab-raft.html
2B: leader and follower code to append new log entries
- Here’s a quick overview of things I need to add:
-
Start
appends the entry to its local log and waits until the entry is committed - During regular heartbeats (batching for free), send all uncommitted entries to each follower
- The leader increments
commitIndex
when a majority of followers have acknowledged an entry - Followers accept uncommitted entries in
AppendEntries
and append them to their local logs, incrementingcommitIndex
based onleaderCommit
- Apply log entries to state machines when
commitIndex
>lastApplied
-
2A: leader election and heartbeats ✅
-
The main implementation was fairly simple, but the edge cases are very hard to discover, debug, or even reason about.
-
As of this commit the 2A tests are passing, but only most of the time. 😬
-
I found a couple of issues with that implementation:
- When an
AppendEntries
receiver sees a newer term and converts itself to a follower, it needs to setvotedFor
to -1 because it hasn’t voted for anyone in this newer term. - A
RequestVote
receiver must convert itself to a follower if it sees a newer term - this applies to leaders that were offline for a while and are coming back online. - Don’t perform any blocking operations when holding a lock (unless absolutely sure the code is deadlock-free).
- I was performing a blocking channel enqueue while holding a lock
- It was possible for the channel receiver to be waiting to acquire that same lock before reading off the channel.
- When an
-
Fixed these issues in this commit
-
Seeing another (sporadic) failure with
TestReElection2A
where it seems like a leader is all alone and doesn’t (yet) know it isn’t the leader anymore, and giving the test the wrong answer. The paper doesn’t address this directly as far as I can tell, but this is possibly a hint:Second, a leader must check whether it has been deposed before processing a read-only request
- “Am I a leader?” could be a read-only request here, and the leader sends off a round of heartbeats before answering the question
-
One other issue I ran into: acquiring a lock at the beginning of a function and releasing it a while after (but not at the end, so
defer
isn’t applicable). In this situation any early-return conditionals within the locked section need to release the lock as well. -
Some failures are rare - the one that this fixes only showed up after 73 runs!
-
Seeing one sporadic failure (showed up after ~350 runs) where two leaders are elected in the same term:
Test (2A): multiple elections ... 2021/08/04 15:45:58 1: Election timer expired; becoming a candidate; old status: follower 2021/08/04 15:45:58 3: Election timer expired; becoming a candidate; old status: follower 2021/08/04 15:45:58 1: Election succeeded; I'm a leader now 2021/08/04 15:45:58 3: Election succeeded; I'm a leader now --- FAIL: TestManyElections2A (0.47s) config.go:389: term 2 has 2 (>1) leaders
-
Different failure this time:
Test (2A): multiple elections ... 2021/08/05 17:07:48 3: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:48 3: Election succeeded; I'm a leader now (term: 1) DISCONNECTING 5 2 0 2021/08/05 17:07:48 0: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:49 6: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:49 6: Election succeeded; I'm a leader now (term: 2) 2021/08/05 17:07:49 2: Election timer expired; becoming a candidate; old status: follower DISCONNECTING 5 3 1 2021/08/05 17:07:49 5: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:49 0: Election timed out; staying a candidate 2021/08/05 17:07:49 0: Election succeeded; I'm a leader now (term: 3) 2021/08/05 17:07:49 2: Election timed out; staying a candidate DISCONNECTING 1 0 4 2021/08/05 17:07:49 5: Election timed out; staying a candidate 2021/08/05 17:07:49 1: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:50 2: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:50 2: Election succeeded; I'm a leader now (term: 4) 2021/08/05 17:07:50 5: Election timed out; staying a candidate 2021/08/05 17:07:50 4: Election timer expired; becoming a candidate; old status: follower DISCONNECTING 2 0 5 2021/08/05 17:07:50 1: Election timed out; staying a candidate 2021/08/05 17:07:50 4: Election timed out; staying a candidate 2021/08/05 17:07:50 4: Election succeeded; I'm a leader now (term: 5) 2021/08/05 17:07:50 5: Election timer expired; becoming a candidate; old status: follower DISCONNECTING 3 3 1 2021/08/05 17:07:50 1: Election timed out; staying a candidate 2021/08/05 17:07:51 5: Election timed out; staying a candidate 2021/08/05 17:07:51 5: Election succeeded; I'm a leader now (term: 6) DISCONNECTING 1 0 5 2021/08/05 17:07:51 1: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:51 1: Election timed out; staying a candidate 2021/08/05 17:07:52 1: Election timed out; staying a candidate 2021/08/05 17:07:52 1: Election timed out; staying a candidate 2021/08/05 17:07:53 1: Election timed out; staying a candidate 2021/08/05 17:07:53 1: Election timed out; staying a candidate 2021/08/05 17:07:54 1: Election timed out; staying a candidate 2021/08/05 17:07:54 1: Election timed out; staying a candidate 2021/08/05 17:07:55 1: Election timed out; staying a candidate 2021/08/05 17:07:55 0: Election timer expired; becoming a candidate; old status: follower 2021/08/05 17:07:55 1: Election timed out; staying a candidate 2021/08/05 17:07:56 0: Election timed out; staying a candidate --- FAIL: TestManyElections2A (8.09s) config.go:400: expected one leader, got none
- If I had to guess, previous leaders are stuck in a loop for older terms without stepping down voluntarily
- Newer leaders have been disconnected before
AppendEntries
messages can go out - Attempting to fix this by having leaders step down if they detect partitions during heartbeats.
-
This seems to have fixed it!
Total runs: 1000 Successes: 1000 Failures: 0
Appendix
- I’m running tests repeatedly until they fail with loop:
loop --summary --num 1000 --until-fail -- "go test -run 2A"