Notes on Distributed Systems for Young Bloods

https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Almost all of these are now obvious to me in hindsight, but this would’ve been an invaluable read a couple of years ago.

Distributed systems are different because they fail often
Writing robust distributed systems costs more than writing robust single-machine systems
Robust, open source distributed systems are much less common than robust, single-machine systems
Coordination is very hard
If you can fit your problem in memory, it’s probably trivial
“It’s slow” is the hardest problem you’ll ever debug
Implement backpressure throughout your system
Find ways to be partially available
Metrics are the only way to get your job done
Use percentiles, not averages
Learn to estimate your capacity
Choose id spaces wisely
- This is probably the only point here that was non-obvious to me. Essentially, how your data is uniquely identified has implications for storage and retrieval.
Consider version 1 of the Twitter API. All operations to get, create, and delete tweets were done with respect to a single numeric id for each tweet. The tweet id is a simple 64-bit number that is not connected to any other piece of data. As the number of tweets goes up, it becomes clear that creating user tweet timelines and the timeline of other user’s subscriptions may be efficiently constructed if all of the tweets by the same user were stored on the same machine. But the public API requires every tweet be addressable by just the tweet id. To partition tweets by user, a lookup service would have to be constructed. One that knows what user owns which tweet id. Doable, if necessary, but with a non-trivial cost. An alternative API could have required the user id in any tweet look up and, initially, simply used the tweet id for storage until user-partitioned storage came online. Another alternative would have included the user id in the tweet id itself at the cost of tweet ids no longer being k-sortable and numeric.
Exploit data-locality

Networks have more failures and more latency than pointer dereferences and fread(3).
Writing cached data back to persistent storage is bad
Computers can do more than you think they can
Use the CAP theorem to critique systems
Extract services