6.824 Lab 2 - Raft

2B: leader and follower code to append new log entries

  • Here’s a quick overview of things I need to add:
    • Start appends the entry to its local log and waits until the entry is committed
    • During regular heartbeats (batching for free), send all uncommitted entries to each follower
    • The leader increments commitIndex when a majority of followers have acknowledged an entry
    • Followers accept uncommitted entries in AppendEntries and append them to their local logs, incrementing commitIndex based on leaderCommit
    • Apply log entries to state machines when commitIndex > lastApplied

2A: leader election and heartbeats ✅

  • The main implementation was fairly simple, but the edge cases are very hard to discover, debug, or even reason about.

  • As of this commit the 2A tests are passing, but only most of the time. 😬

  • I found a couple of issues with that implementation:

    • When an AppendEntries receiver sees a newer term and converts itself to a follower, it needs to set votedFor to -1 because it hasn’t voted for anyone in this newer term.
    • A RequestVote receiver must convert itself to a follower if it sees a newer term - this applies to leaders that were offline for a while and are coming back online.
    • Don’t perform any blocking operations when holding a lock (unless absolutely sure the code is deadlock-free).
      • I was performing a blocking channel enqueue while holding a lock
      • It was possible for the channel receiver to be waiting to acquire that same lock before reading off the channel.
  • Fixed these issues in this commit

  • Seeing another (sporadic) failure with TestReElection2A where it seems like a leader is all alone and doesn’t (yet) know it isn’t the leader anymore, and giving the test the wrong answer. The paper doesn’t address this directly as far as I can tell, but this is possibly a hint:

    Second, a leader must check whether it has been deposed before processing a read-only request

    • “Am I a leader?” could be a read-only request here, and the leader sends off a round of heartbeats before answering the question
  • One other issue I ran into: acquiring a lock at the beginning of a function and releasing it a while after (but not at the end, so defer isn’t applicable). In this situation any early-return conditionals within the locked section need to release the lock as well.

  • Some failures are rare - the one that this fixes only showed up after 73 runs!

  • Seeing one sporadic failure (showed up after ~350 runs) where two leaders are elected in the same term:

    Test (2A): multiple elections ...
    2021/08/04 15:45:58 1: Election timer expired; becoming a candidate; old status: follower
    2021/08/04 15:45:58 3: Election timer expired; becoming a candidate; old status: follower
    2021/08/04 15:45:58 1: Election succeeded; I'm a leader now
    2021/08/04 15:45:58 3: Election succeeded; I'm a leader now
    --- FAIL: TestManyElections2A (0.47s)
        config.go:389: term 2 has 2 (>1) leaders
    
    • Can’t find any issues with the implementation; the only potential cause is this - time will tell!
    • Nope, it’s still happening 😞
    • Why does this say “term 2”? Shouldn’t it be term 1? 🤔
    • I think I have it - let’s see if this works
  • Different failure this time:

    Test (2A): multiple elections ...
    2021/08/05 17:07:48 3: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:48 3: Election succeeded; I'm a leader now (term: 1)
    DISCONNECTING  5 2 0
    2021/08/05 17:07:48 0: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:49 6: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:49 6: Election succeeded; I'm a leader now (term: 2)
    2021/08/05 17:07:49 2: Election timer expired; becoming a candidate; old status: follower
    DISCONNECTING  5 3 1
    2021/08/05 17:07:49 5: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:49 0: Election timed out; staying a candidate
    2021/08/05 17:07:49 0: Election succeeded; I'm a leader now (term: 3)
    2021/08/05 17:07:49 2: Election timed out; staying a candidate
    DISCONNECTING  1 0 4
    2021/08/05 17:07:49 5: Election timed out; staying a candidate
    2021/08/05 17:07:49 1: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:50 2: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:50 2: Election succeeded; I'm a leader now (term: 4)
    2021/08/05 17:07:50 5: Election timed out; staying a candidate
    2021/08/05 17:07:50 4: Election timer expired; becoming a candidate; old status: follower
    DISCONNECTING  2 0 5
    2021/08/05 17:07:50 1: Election timed out; staying a candidate
    2021/08/05 17:07:50 4: Election timed out; staying a candidate
    2021/08/05 17:07:50 4: Election succeeded; I'm a leader now (term: 5)
    2021/08/05 17:07:50 5: Election timer expired; becoming a candidate; old status: follower
    DISCONNECTING  3 3 1
    2021/08/05 17:07:50 1: Election timed out; staying a candidate
    2021/08/05 17:07:51 5: Election timed out; staying a candidate
    2021/08/05 17:07:51 5: Election succeeded; I'm a leader now (term: 6)
    DISCONNECTING  1 0 5
    2021/08/05 17:07:51 1: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:51 1: Election timed out; staying a candidate
    2021/08/05 17:07:52 1: Election timed out; staying a candidate
    2021/08/05 17:07:52 1: Election timed out; staying a candidate
    2021/08/05 17:07:53 1: Election timed out; staying a candidate
    2021/08/05 17:07:53 1: Election timed out; staying a candidate
    2021/08/05 17:07:54 1: Election timed out; staying a candidate
    2021/08/05 17:07:54 1: Election timed out; staying a candidate
    2021/08/05 17:07:55 1: Election timed out; staying a candidate
    2021/08/05 17:07:55 0: Election timer expired; becoming a candidate; old status: follower
    2021/08/05 17:07:55 1: Election timed out; staying a candidate
    2021/08/05 17:07:56 0: Election timed out; staying a candidate
    --- FAIL: TestManyElections2A (8.09s)
        config.go:400: expected one leader, got none
    
    • If I had to guess, previous leaders are stuck in a loop for older terms without stepping down voluntarily
    • Newer leaders have been disconnected before AppendEntries messages can go out
    • Attempting to fix this by having leaders step down if they detect partitions during heartbeats.
  • This seems to have fixed it!

    Total runs:	1000
    Successes:	1000
    Failures:	0
    

Appendix

  • I’m running tests repeatedly until they fail with loop:
    loop --summary --num 1000 --until-fail -- "go test -run 2A"
    
Edit