SWIM — Scalable Weakly-consistent Infection-style Process Group Membership Protocol
- https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf
- Questions
- How does Consul achieve (linearizable) consensus using an eventually consistent protocol?
- It doesn’t. It uses SWIM to send messages across the cluster and for membership, but it implements Raft for the actual consensus
- How does Consul achieve (linearizable) consensus using an eventually consistent protocol?
1-2: Introduction + Prior Work
-
Weakly-consistent cluster membership at scale
-
Traditional heartbeat-based systems are unscalable
- Network load grows quadratically if every heartbeat is sent to every other node
- Which leads to more dropped packets
- Which leads to a spike in false positives and/or time to first detection of membership changes
-
Similar issues when heartbeats are sent to a set of centralized servers (hotspots) or along a ring (unpredictable time-to-first-detection, multiple failures can lead to lost updates)
-
SWIM is a peer-to-peer system where both network load and time to first detection of a membership change is constant wrt cluster size
- The main tradeoff is that SWIM is eventually consistent
- Scalability issues (like ^^^^) if it were to be turned into a linearizable system
-
Traditional all-to-all heartbeat systems conflate failure detection and membership change propagation
- If we could reliably detect failures without involving all nodes, and only multicast to all nodes when a failure (or other membership change) occurs, this could keep network load from growing quadratically
- With that model, failure detection at some node happens in constant time, and failure detection time at any node is relative to cluster size
-
From the paper, SWIM provides a membership system that:
- imposes a constant message load per group member
- detects a process failure in an (expected) constant time at some non-faulty process in the group;
- provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to de- tect failure of another process;
- propagates membership updates, including informa- tion about failures, in infection-style (also gossip-style or epidemic-style); the dissemination latency in the group grows slowly (logarithmically) with the number of members;
- provides a mechanism to reduce the rate of false pos- itives by “suspecting” a process before “declaring” it as failed within the group.
- 1 & 2 are from this paper
- 3, 4, and 5 are unique to SWIM
-
Failure-detection dimensions:
- Strong Completeness: crash-failure of any group member is detected by all non-faulty members
- Speed of failure detection: the time interval between a member failure and its detection by some non-faulty group member
- Accuracy: the rate of false positives of failure detection
- Network Message Load, in bytes per second generated by the protocol
-
It’s impossible to have a strongly complete system that also has no false positives; the typical approach is to guarantee strong completeness while trying to minimize false positives
3: The Basic SWIM Approach
- Two components: failure detector & disseminator
Failure Detection
- High-level operation
- Divide time (local to each node) into protocol periods
- Every node knows about all other nodes in the cluster (membership list)
- During a protocol period, a given node ($M_i$ in the image) randomly picks one other node ($M_j$) and sends it a ping
- The pinged node immediately sends back an ack
- If the ack isn’t received before $M_i$ hits a timeout (which must be smaller than the protocol period), $M_i$ then asks a randomly-chosen set of $k$ other nodes (called a failure detection subgroup) about $M_j$ using a ping-req message. This is called an indirect probe
- These other nodes ping $M_j$ and forward acks (if any) back to $M_i$
- At the end of the protocol period, $M_i$ checks if received any acks (either direct or indirect)
- $M_j$ is marked failed if no acks were received during the protocol period
- The direct probe timeout is dynamically chosen based on RTT, and the protocol period duration must be at least 3x this number
- Each protocol period has a unique sequence number, and all messages during a protocol period are tagged with its sequence number
- Using indirect probing instead of something like backoff guards against false positives because of a partition or congestion between $M_i$ and $M_j$
- Smaller protocol periods -> faster detection (at the cost of more network load?)
Dissemination
- When a node is marked failed, this information is multicasted to the entire cluster as a failed message
- All recepients delete the failed node from their membership lists
- A node joining the cluster either notifies a central coordinator, or broadcasts a join message that nodes either (randomly) respond to or ignore
4: A More Robust and Efficient SWIM
- The protocols above aren’t without their faults:
- A node which can send messages but can’t receive messages (full recv buffer, for example) will continually mark alive nodes as failed
- Multicast for dissemination is unreliable; a more-expensive broadcast must be used if multicast isn’t available
Infection-Style Dissemination
- Eschew the need for multicast altogether by piggybacking dissemination messages (joined, failed, etc.) on failure-detection messages (ping, ping-req, ack)
- Membership information makes its way around the cluster over multiple protocol periods, as overlapping subgroups are chosen during each one
- This is an epidemic process that spreads exponentially fast in the group
- Each node queues membership change messages that are to be sent over failure-detection messages
- Newer membership change messages are preferred if there are more messages than can be sent
Suspicion
- Marking a node failed the first time a different node is unable to reach it is too severe
- Instead, use a suspicion mechanism. Instead of going from Healthy -> Failed, go from Healthy -> Suspected -> Failed
- Say $M_i$ is sending probes out against $M_j$
- When a direct & indirect probe both fail, mark a node Suspected by sending a “$M_i$ suspects $M_j$” message to the cluster
- All nodes that receive this message locally mark $M_j$ as suspected
- If any node that suspects $M_j$ then sucessfully pings it, it marks it healthy again and sends out a “$M_j$ is healthy” message to the cluster
- If $M_j$ itself receives news that it’s been suspected, it can send out an “$M_j$ is healthy” message immediately
- If $M_j$ stays suspected for a certain amount of time, it’s marked failed and an “$M_j$ has failed” message is sent out
- Alive messages override suspect messages, failed messages override both alive and suspect messages