6.824 Lecture 3 - GFS


  • Distributed storage (S3, GFS, etc.) is a very useful abstraction
  • Distributed systems are largely either storage systems or systems built on storage systems
  • Tradeoff between consistency & performance; this is sort of a loop:
    • You want a performant system, so you shard your data
    • Sharding across 1000s of machines makes faults 100% likely
    • You build fault tolerance using replication
    • Replication can potentially cause 2+ inconsistent copies of the data
    • You fix this inconsistency by using a consistent design, but reduce performance
  • Strong consistency: pretend that the system uses a single machine & there’s one copy of the data (this is pretty much exactly
    Book
    DDIA’s definition)
  • GFS was explicitly designed to be used in a single data center (lots of synchronous network IO before operations are done)
  • Used internally at Google, not directly available to customers
  • Ideas in this paper aren’t novel, but the fact that this system is used in the real world at a relatively large scale makes it important nonetheless.
  • GFS doesn’t provide strong consistency; this idea was “heretical” at the time in academic circles. Google was able to get away with this because:
    • It controlled all clients, so it wasn’t unreasonable to have the clients do more work (checksumming, being careful about record boundaries)
    • GFS' use cases could tolerate stale results - search, etc.
  • The client application performs IO in terms of files and byte offsets within the entire (potentially very large) file. The GFS client library and the master translate that to a chunked view.
  • When the master reboots, (for each chunk) it must wait until it knows about at least one chunkserver with a version for that chunk >= the master' stored version for that chunk.
  • When the master can’t reach a primary replica (for a chunk), it needs to wait until the lease expires before it designates a new primary
  • View of 3 replicas after a couple of appends (file is upside down): Pasted image 20210727155728.png
    • D is a record that failed that the client chose not to retry
    • B is a record that failed that the client then (successfully) retried
    • Reading the entire replica in order with deduplication is not guaranteed to provide a consistent view (because of D’s scenario)
  • What concerns might be relevant if we were to “upgrade” GFS to a strongly consistent system?
    • Primaries perform deduplication
    • Secondary drift shouldn’t be allowed; secondaries should either stay in sync or be removed from contention if they can’t comply
    • Multiple phases so partial writes aren’t necessarily persisted (2PC)
    • A newly designated primary may need to catch up before accepting requests (if replication is async, presumably)
  • Master failover isn’t automatic, and this proved to be a problem
Edit