6.824 Lecture 3 - GFS

Lecture: https://www.youtube.com/watch?v=EpIgvowZr00
Paper:
Paper
The Google File System
FAQ: https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt

Distributed storage (S3, GFS, etc.) is a very useful abstraction
Distributed systems are largely either storage systems or systems built on storage systems
Tradeoff between consistency & performance; this is sort of a loop:
- You want a performant system, so you shard your data
- Sharding across 1000s of machines makes faults 100% likely
- You build fault tolerance using replication
- Replication can potentially cause 2+ inconsistent copies of the data
- You fix this inconsistency by using a consistent design, but reduce performance
Strong consistency: pretend that the system uses a single machine & there’s one copy of the data (this is pretty much exactly
Book
DDIA’s definition)
GFS was explicitly designed to be used in a single data center (lots of synchronous network IO before operations are done)
Used internally at Google, not directly available to customers
Ideas in this paper aren’t novel, but the fact that this system is used in the real world at a relatively large scale makes it important nonetheless.
GFS doesn’t provide strong consistency; this idea was “heretical” at the time in academic circles. Google was able to get away with this because:
- It controlled all clients, so it wasn’t unreasonable to have the clients do more work (checksumming, being careful about record boundaries)
- GFS' use cases could tolerate stale results - search, etc.
The client application performs IO in terms of files and byte offsets within the entire (potentially very large) file. The GFS client library and the master translate that to a chunked view.
When the master reboots, (for each chunk) it must wait until it knows about at least one chunkserver with a version for that chunk >= the master' stored version for that chunk.
When the master can’t reach a primary replica (for a chunk), it needs to wait until the lease expires before it designates a new primary
View of 3 replicas after a couple of appends (file is upside down):
- D is a record that failed that the client chose not to retry
- B is a record that failed that the client then (successfully) retried
- Reading the entire replica in order with deduplication is not guaranteed to provide a consistent view (because of D’s scenario)
What concerns might be relevant if we were to “upgrade” GFS to a strongly consistent system?
- Primaries perform deduplication
- Secondary drift shouldn’t be allowed; secondaries should either stay in sync or be removed from contention if they can’t comply
- Multiple phases so partial writes aren’t necessarily persisted (2PC)
- A newly designated primary may need to catch up before accepting requests (if replication is async, presumably)
Master failover isn’t automatic, and this proved to be a problem