aphyr-distsys-class

https://github.com/aphyr/distsys-class

This is full of extremely good advice. Notes below:

TCP
- It works, use it.
- You can go faster, but usually not necessary.
- Ordering over multiple connections is not guaranteed, you’ll have to use your own sequence numbers over TCP to implement this
UDP data might be dropped, but also can be delivered more than once, which is problematic for things like metrics.
Clocks
- “POSIX time is not monotonic by definition” (this affected Cloudflare)
- Suspension at various levels: threads, runtimes, OSes, hardware
- OS monotonic clocks aren’t always monotonic 😱 (rust link)
- GPS/atomic clocks are better than NTP. Truetime is the only real production impl. right now (and they aren’t sharing)
Apdex is a standard for measuring uptime/SLA
“Big data is usually less important, small data is usually critical”
CALM Conjecture
- Consistency As Logical Monotonicity: “If you can prove a system is logically monotonic, it is coordination free”
- Monotonicity, informally, is retraction-free. With monotonicity, once we learn something to be true, no further information can come down the line later on to refute that fact.
- Avoid consensus/coordination where possible; accidental coordination vs. essential coordination
- “Try to phrase your problem such that you only add new facts to the system. Likely tradeoff: incomplete reads.”
- Datalog
- The original paper (link, the morning paper) seems to advocate for this style not necessarily in a distributed environment, so I’m not sure I understand this concretely yet.
Gossip
- Couple of mechanisms: global broadcast (messages go to every other node), mesh networks (messages go to neighbors), spanning trees (messages go to neighbors in a spanning tree instead of a mesh, less noisy)
- Useful for cluster management, service discovery, health sensors, CDNs, etc
  - How exactly is this used for a CDN? I think I understand the others but not this one. 🤔
Highly Available Transactions (HATs) is a protocol (or algorithm?) for eventually consistent distributed transactions. Low latency, so suitable for planet-scale systems.
Consensus
- The consensus problem is equivalent to: lock services, totally ordered logs, replicated state machines
- The FLP result says consensus is impossible in async networks, but this model makes a stricter set of assumptions than are practical. For example: no clocks and not even a random number generator. Nothing non-deterministic at all!
- “Paxos is really more of a family of algorithms than a well-described single entity”
  - Used in Chubby, Cassandra, Riak, FoundationDB
- ZAB/ZK
  - Not strictly linearizable; reads with sync can be linearizable most of the time, but the guarantee is “lagging ordered reads” even with the use of sync.
  - (Speculation) The ordering guarantee should allow for linearizability. Write a value, wait for the value to appear in the log, and then perform the read at that point. I think ZK does pretty much this with “watches”. From the docs:
  ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.
- Chain replication: a version of synchronous replication that is supposed to be better in some way; I’m not really seeing it though - how is this better than regular sync replication if one of these servers goes down? Maybe the point is just to reduce the amount of write load on the primary (each node accepts one request and sends one request instead of the primary sending N requests)? /fig1.png
- Raft
  - There’s a Coq proof of the core algorithm
  - Examples: RethinkDB, etcd, Consul
- (Distributed) Transactions
  - Strong-1SR == Strong 1-copy Serializability == Linearizability + serializability
  - Single-writer: readers use a snapshot, all writes are ordered through a single node/queue. This ensures that no two write txns are concurrent.
  - Multiple writers: Have several shards, each with a consensus system for linearizability. Needs a protocol for cross-shard transactions.
  - Disjoint shards: Disallow cross-shard transactions
  - Percolator uses snapshot isolation over linearizable shards
  - Spanner uses 2 phase commit over Paxos groups. Latency floor (Truetime) to ensure timestamp monotonicity.
- Latency
  - Never zero
  - Multicore systems (especially NUMA, where each CPU has it’s own memory - local accesses are cheaper, nonlocal accesses are more expensive) are like a distributed system themselves. FENCE instructions (typically ~100 cycles) bring this up to a consistent state.
  - Avoid coordination between cores wherever possible, processor pinning can help.
  - Context switches are expensive (thread, process, kernel)
  - Latency across a typical Ethernet LAN can be ~100µs, but expect 1000s of µs on AWS 😭
  - “Sometimes, packets could be delayed by five minutes”
  - Network access can be faster than disk on AWS if you’re using EBS
- Geo replication
  - “Entire Amazon regions can and will fail – yes, regions, not AZs”
  - Minimum of 1 round-trip for consensus, possibly even 4, so be very careful of cross-datacenter linearizability
  - “CRDTs can always give you safe local writes” ← client-server multi-master architecture
  - Keep things disjoint (by continent or even country), and use consensus only to migrate users between datacenters.
- Common distributed systems
  - “Outsourced heaps”: Redis, memcached
  - KV: Riak, CouchDB, Mongo, Cassandra, HDFS
  - SQL: Postgres, MySQL, Percona XtraDB, Oracle, MSSQL, VoltDB, CockroachDB
  - Search: Elasticsearch, SolrCloud (typically weak consistency)
  - Coordination: ZK, etcd, Consul
    - Consul provides linearizable reads but not by default
  - Streaming: Storm, Spark (low latencies, high throughput)
  - Log/Queues: Kafka, Kestrel, Rabbit, {Iron,Active}MQ, HornetQ, Beanstalk, SQS, Celery
    - “The only one I know that won’t lose data in a partition is Kafka”
    - “Queues can get you out of a bind when you’ve chosen a poor runtime” HAHA SO TRUE 😭
Don’t distribute where you don’t have to
- You can get computers with 6TBs of RAM
- “Production JVM HTTP services I’ve known have pushed 50K requests/sec”
- “Protocol buffers over TCP: 10 million events/sec”
  - With 10-100 events in a batch, processed in-memory
If we have to distribute, can we push the work onto some other software?
Distributed systems aren’t just characterized by latency, but by recurrent, partial failure
Writing recovery-first code keeps you from punting on error handling
Backups are essentially sequential consistency, BUT you lose a window of ops.
“I’m not a huge believer in active-spare, active-active wherever possible”
- Not sure I agree with this; what if you run into trouble maintaining consistency between the two active nodes? What if one crashes? What if one runs out of disk space but the other doesn’t?
“I generally want three copies of data, for important stuff, 4 or 5”
Common DR strategy: Paxos across 5 nodes; 3 or 4 in primary DC
- Ops can complete as soon as the local nodes ack; low latencies
- Resilient to single-node failure (though latencies will spike)
- But you still have a sequentially consistent backup in the other DC
  - So in the event you lose an entire DC, all’s not lost
“Redundancy improves availability so long as failures are uncorrelated”. Examples: /Screen Shot 2021-07-09 at 3.15.49 PM.png
Sharding is a specific case of a more general pattern: avoiding coordination
At scale, ID structure can make or break you
- Sequential IDs require coordination: can you avoid them?
- Flake IDs are an alternative to UUIDs
- Can your ID map directly to a shard?
Immutability
- Data that never changes is trivial to store
- LSM-trees, Kafka
- Extremely high availability and durability, tunable write latency
- Requires GC
- Mutable identities: Not sure if I understand this correctly, but I think the core idea is to store your actual data in an immutable AP system, and store a semantic layer composed of pointers to the AP store in a CP store.
Backpressure
- Consume resources and explode
- Shed load. Start dropping requests.
- Reject requests. Ignore the work and tell clients it failed.
- Apply backpressure to clients, asking them to slow down.
SOA
- OO approach: each noun is a service (user, video, index services)
- Functional approach: each verb is a service (auth, search, routing services)
- Services come with overhead: have as few as possible
- Try to build trees instead of webs
- Avoid having outsiders manipulate a service’s data store directly
- Initially, all your services should be libraries
- Coordination between services requires special protocols
Migrations
- Migrations are hard, no silver bullet
- Hard cut: write new system and migration to copy old data to it. Turn off old system, copy data, start up new system. Downtime!
- Incremental: write new system, deploy alongside. Dependent services talk to both. Consistency is an issue. No downtime.
- Wrapper: create a wrapper service that talks to the old system. All dependents talk to the wrapper. The wrapper talks to the new system as well. Phase out the old system. The wrapper talks only to the new system.
Production
- Developers have to care about production. Ops has to care about implementation. Good communication enables faster diagnosis.
- Type systems are great for preventing logical errors, but you still need a test suite.
- Testing distributed systems is much, much harder than testing local ones
- Jeff Hodges: The worst bug you’ll ever hear is “it’s slow”
  - Because the system is distributed, have to profile multiple nodes
  - Profilers are good at finding CPU problems, but slowness could more often mean problems with IO
  - Latency variance between nodes doing the same work is an important signal
  - Tail latencies are magnified by fanout workloads (Tail at Scale)
- Slowness (and outright errors) in prod stem from the interactions between systems. Hard to test this, need high-freq monitoring in production instead:
  - Production behaviors can take place on 1ms scales
  - TCP incast (Ethernet buffers fill up and start dropping packets)
  - Ideally, millisecond latencies, maybe ms resolution too, Usually cost-prohibitive; back off to 1s or 10s
- Superpower: distributed tracing infra (Zipkin, Dapper, etc)
- Logging is less useful at scale. Structured logging is more useful than unstructured logging.
- Load tests are only useful insofar as the simulated load matches the actual load. Consider dumping production traffic to mimic actual load:
  - Awesome: kill a process with SIGUSR1, it dumps five minutes of request load
  - Awesome: tcpdump/tcpreplay harnesses for requests
  - Awesome: shadowing live prod traffic to your staging/QA nodes
- Every queue is a place for things to go horribly, horribly wrong
  - No node has unbounded memory. Your queues must be bounded.
  - But how big? Nobody knows
  - Queues exist to smooth out fluctuations in load

Resources

Mixu has a delightful book on distributed systems with incredible detail.
Jeff Hodges has some excellent, production-focused advice.
The Fallacies of Distributed Computing is a classic text on mistaken assumptions we make designing distributed systems.
Christopher Meiklejohn has a list of key papers in distributed systems.
Dan Creswell has a lovely reading list.

I skipped (or didn’t understand):

Confluence