SWIM — Scalable Weakly-consistent Infection-style Process Group Membership Protocol

1-2: Introduction + Prior Work

  • Weakly-consistent cluster membership at scale

  • Traditional heartbeat-based systems are unscalable

    • Network load grows quadratically if every heartbeat is sent to every other node
    • Which leads to more dropped packets
    • Which leads to a spike in false positives and/or time to first detection of membership changes
  • Similar issues when heartbeats are sent to a set of centralized servers (hotspots) or along a ring (unpredictable time-to-first-detection, multiple failures can lead to lost updates)

  • SWIM is a peer-to-peer system where both network load and time to first detection of a membership change is constant wrt cluster size

    • The main tradeoff is that SWIM is eventually consistent
    • Scalability issues (like ^^^^) if it were to be turned into a linearizable system
  • Traditional all-to-all heartbeat systems conflate failure detection and membership change propagation

    • If we could reliably detect failures without involving all nodes, and only multicast to all nodes when a failure (or other membership change) occurs, this could keep network load from growing quadratically
    • With that model, failure detection at some node happens in constant time, and failure detection time at any node is relative to cluster size
  • From the paper, SWIM provides a membership system that:

    1. imposes a constant message load per group member
    2. detects a process failure in an (expected) constant time at some non-faulty process in the group;
    3. provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to de- tect failure of another process;
    4. propagates membership updates, including informa- tion about failures, in infection-style (also gossip-style or epidemic-style); the dissemination latency in the group grows slowly (logarithmically) with the number of members;
    5. provides a mechanism to reduce the rate of false pos- itives by “suspecting” a process before “declaring” it as failed within the group.
    • 1 & 2 are from this paper
    • 3, 4, and 5 are unique to SWIM
  • Failure-detection dimensions:

    1. Strong Completeness: crash-failure of any group member is detected by all non-faulty members
    2. Speed of failure detection: the time interval between a member failure and its detection by some non-faulty group member
    3. Accuracy: the rate of false positives of failure detection
    4. Network Message Load, in bytes per second generated by the protocol
  • It’s impossible to have a strongly complete system that also has no false positives; the typical approach is to guarantee strong completeness while trying to minimize false positives

3: The Basic SWIM Approach

  • Two components: failure detector & disseminator

Failure Detection

  • 600
  • High-level operation
    • Divide time (local to each node) into protocol periods
    • Every node knows about all other nodes in the cluster (membership list)
    • During a protocol period, a given node ($M_i$ in the image) randomly picks one other node ($M_j$) and sends it a ping
    • The pinged node immediately sends back an ack
    • If the ack isn’t received before $M_i$ hits a timeout (which must be smaller than the protocol period), $M_i$ then asks a randomly-chosen set of $k$ other nodes (called a failure detection subgroup) about $M_j$ using a ping-req message. This is called an indirect probe
    • These other nodes ping $M_j$ and forward acks (if any) back to $M_i$
    • At the end of the protocol period, $M_i$ checks if received any acks (either direct or indirect)
    • $M_j$ is marked failed if no acks were received during the protocol period
  • The direct probe timeout is dynamically chosen based on RTT, and the protocol period duration must be at least 3x this number
  • Each protocol period has a unique sequence number, and all messages during a protocol period are tagged with its sequence number
  • Using indirect probing instead of something like backoff guards against false positives because of a partition or congestion between $M_i$ and $M_j$
  • Smaller protocol periods -> faster detection (at the cost of more network load?)

Dissemination

  • When a node is marked failed, this information is multicasted to the entire cluster as a failed message
  • All recepients delete the failed node from their membership lists
  • A node joining the cluster either notifies a central coordinator, or broadcasts a join message that nodes either (randomly) respond to or ignore

4: A More Robust and Efficient SWIM

  • The protocols above aren’t without their faults:
    • A node which can send messages but can’t receive messages (full recv buffer, for example) will continually mark alive nodes as failed
    • Multicast for dissemination is unreliable; a more-expensive broadcast must be used if multicast isn’t available

Infection-Style Dissemination

  • Eschew the need for multicast altogether by piggybacking dissemination messages (joined, failed, etc.) on failure-detection messages (ping, ping-req, ack)
  • Membership information makes its way around the cluster over multiple protocol periods, as overlapping subgroups are chosen during each one
  • This is an epidemic process that spreads exponentially fast in the group
  • Each node queues membership change messages that are to be sent over failure-detection messages
  • Newer membership change messages are preferred if there are more messages than can be sent

Suspicion

  • Marking a node failed the first time a different node is unable to reach it is too severe
  • Instead, use a suspicion mechanism. Instead of going from Healthy -> Failed, go from Healthy -> Suspected -> Failed
  • Say $M_i$ is sending probes out against $M_j$
    • When a direct & indirect probe both fail, mark a node Suspected by sending a “$M_i$ suspects $M_j$” message to the cluster
    • All nodes that receive this message locally mark $M_j$ as suspected
    • If any node that suspects $M_j$ then sucessfully pings it, it marks it healthy again and sends out a “$M_j$ is healthy” message to the cluster
    • If $M_j$ itself receives news that it’s been suspected, it can send out an “$M_j$ is healthy” message immediately
    • If $M_j$ stays suspected for a certain amount of time, it’s marked failed and an “$M_j$ has failed” message is sent out
    • Alive messages override suspect messages, failed messages override both alive and suspect messages
Edit