SWIM — Scalable Weakly-consistent Infection-style Process Group Membership Protocol

https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf
Questions
- How does Consul achieve (linearizable) consensus using an eventually consistent protocol?
  - It doesn’t. It uses SWIM to send messages across the cluster and for membership, but it implements Raft for the actual consensus

1-2: Introduction + Prior Work

Weakly-consistent cluster membership at scale
Traditional heartbeat-based systems are unscalable
- Network load grows quadratically if every heartbeat is sent to every other node
- Which leads to more dropped packets
- Which leads to a spike in false positives and/or time to first detection of membership changes
Similar issues when heartbeats are sent to a set of centralized servers (hotspots) or along a ring (unpredictable time-to-first-detection, multiple failures can lead to lost updates)
SWIM is a peer-to-peer system where both network load and time to first detection of a membership change is constant wrt cluster size
- The main tradeoff is that SWIM is eventually consistent
- Scalability issues (like ^^^^) if it were to be turned into a linearizable system
Traditional all-to-all heartbeat systems conflate failure detection and membership change propagation
- If we could reliably detect failures without involving all nodes, and only multicast to all nodes when a failure (or other membership change) occurs, this could keep network load from growing quadratically
- With that model, failure detection at some node happens in constant time, and failure detection time at any node is relative to cluster size
From the paper, SWIM provides a membership system that:
1. imposes a constant message load per group member
2. detects a process failure in an (expected) constant time at some non-faulty process in the group;
3. provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to de- tect failure of another process;
4. propagates membership updates, including informa- tion about failures, in infection-style (also gossip-style or epidemic-style); the dissemination latency in the group grows slowly (logarithmically) with the number of members;
5. provides a mechanism to reduce the rate of false pos- itives by “suspecting” a process before “declaring” it as failed within the group.
- 1 & 2 are from this paper
- 3, 4, and 5 are unique to SWIM
Failure-detection dimensions:
1. Strong Completeness: crash-failure of any group member is detected by all non-faulty members
2. Speed of failure detection: the time interval between a member failure and its detection by some non-faulty group member
3. Accuracy: the rate of false positives of failure detection
4. Network Message Load, in bytes per second generated by the protocol
It’s impossible to have a strongly complete system that also has no false positives; the typical approach is to guarantee strong completeness while trying to minimize false positives

3: The Basic SWIM Approach

Two components: failure detector & disseminator

Failure Detection

High-level operation
- Divide time (local to each node) into protocol periods
- Every node knows about all other nodes in the cluster (membership list)
- During a protocol period, a given node ($M_i$ in the image) randomly picks one other node ($M_j$) and sends it a ping
- The pinged node immediately sends back an ack
- If the ack isn’t received before $M_i$ hits a timeout (which must be smaller than the protocol period), $M_i$ then asks a randomly-chosen set of $k$ other nodes (called a failure detection subgroup) about $M_j$ using a ping-req message. This is called an indirect probe
- These other nodes ping $M_j$ and forward acks (if any) back to $M_i$
- At the end of the protocol period, $M_i$ checks if received any acks (either direct or indirect)
- $M_j$ is marked failed if no acks were received during the protocol period
The direct probe timeout is dynamically chosen based on RTT, and the protocol period duration must be at least 3x this number
Each protocol period has a unique sequence number, and all messages during a protocol period are tagged with its sequence number
Using indirect probing instead of something like backoff guards against false positives because of a partition or congestion between $M_i$ and $M_j$
Smaller protocol periods -> faster detection (at the cost of more network load?)

Dissemination

When a node is marked failed, this information is multicasted to the entire cluster as a failed message
All recepients delete the failed node from their membership lists
A node joining the cluster either notifies a central coordinator, or broadcasts a join message that nodes either (randomly) respond to or ignore

4: A More Robust and Efficient SWIM

The protocols above aren’t without their faults:
- A node which can send messages but can’t receive messages (full recv buffer, for example) will continually mark alive nodes as failed
- Multicast for dissemination is unreliable; a more-expensive broadcast must be used if multicast isn’t available

Infection-Style Dissemination

Eschew the need for multicast altogether by piggybacking dissemination messages (joined, failed, etc.) on failure-detection messages (ping, ping-req, ack)
Membership information makes its way around the cluster over multiple protocol periods, as overlapping subgroups are chosen during each one
This is an epidemic process that spreads exponentially fast in the group
Each node queues membership change messages that are to be sent over failure-detection messages
Newer membership change messages are preferred if there are more messages than can be sent

Suspicion

Marking a node failed the first time a different node is unable to reach it is too severe
Instead, use a suspicion mechanism. Instead of going from Healthy -> Failed, go from Healthy -> Suspected -> Failed
Say $M_i$ is sending probes out against $M_j$
- When a direct & indirect probe both fail, mark a node Suspected by sending a “$M_i$ suspects $M_j$” message to the cluster
- All nodes that receive this message locally mark $M_j$ as suspected
- If any node that suspects $M_j$ then sucessfully pings it, it marks it healthy again and sends out a “$M_j$ is healthy” message to the cluster
- If $M_j$ itself receives news that it’s been suspected, it can send out an “$M_j$ is healthy” message immediately
- If $M_j$ stays suspected for a certain amount of time, it’s marked failed and an “$M_j$ has failed” message is sent out
- Alive messages override suspect messages, failed messages override both alive and suspect messages