The Google File System

https://pdos.csail.mit.edu/6.824/papers/gfs.pdf

Notes

  • This is from 2003, one year before the
    Paper
    MapReduce
  • Largest deployment at Google stores 300+ TB on 1000+ nodes
  • Not used at Google anymore in favor of Colossus, which doesn’t seem to have a (public, anyway) paper.
  • The fact that masters store a single copy of metadata in-memory makes bookkeeping so much easier!
  • The design uses synchronous network IO to maintain consistency; this (presumably) makes GFS unsuitable for use in multi-datacenter deployments:
    • Leases require all replicas to synchronously increment their version numbers first
    • Clients must wait until all replicas have received the data before issuing a write request
    • Primary replicas wait for an ack from all secondaries before reporting a successful write
  • The client (library) needs to do significantly more work compared to something like S3
  • How do chunkservers perform file IO? mmap? Manual direct IO with something like io_submit?

2. Design

  • Designed for large (multi-GB) files that each contain many “objects” instead of a much larger number of small files

    • These objects may be surrounded by junk; failed writes aren’t rolled back, but successful writes return the exact offsets of the written area
  • Optimizes for files that are 100MB or larger; smaller files are supported but probably won’t be performant

  • Assumes that files are primarily (but not exclusively) modified by appending to them

  • Large contiguous/streaming reads are preferred; applications should batch point reads. Relatedly, optimize for throughput over latency.

  • Provides an atomic append operation so multiple clients can write to the same file concurrently

    Our files are often used as producer/consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file.

  • Filesystem operations: create, delete, open, close, read, write, snapshot (copy), record append.

Architecture

Pasted image 20210726134340.png

  • One master, many “chunkservers”, many clients

  • Clients can be colocated with chunkservers if necessary (MapReduce does this to minimize network usage)

  • Files are divided into chunks, each with a unique 64-bit id

  • Each chunk is replicated onto 3 (by default) chunkservers, each of which stores it on a local fs

  • Master maintains fs metadata, including the file->chunk mapping and the locations of each chunk

  • Master also takes care of deleting unused/orphaned chunks and “chunk migration”

  • One-way master→client heartbeats

  • Clients connect to the master for metadata, but connect directly to chunkservers for data operations

  • No GFS-level data caches on clients or chunkservers because this is optimized for large+infrequent reads.

    • The Linux page cache on the chunkservers is the only caching layer for data
  • Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge.

  • The client is free to contact any of the replicas a chunk is present on, and usually picks the nearest

  • Default chunk size is 64MB, but isn’t allocated eagerly. So a 4MB GFS file will end up as a 4MB file on disk, even though the chunk is logically 64MB.

    • This much larger than the typical 4kB used for fs blocks.
    • Larger chunks mean the master needs to be contacted less, and the amount of metadata required for a multi-TB file is a lot smaller.
    • This also means fewer chunks, so the master has to store fewer metadata items, making it easier to store this in memory.
    • Disadvantages: smaller (one or a few chunks) popular files can cause some chunkservers to become hotspots. Google solved this by increasing rf for just these files.
  • Clients keep an TCP connection open to a chunkserver if they need to perform more than one operation on a chunk. What protocol over TCP?

Metadata

  • Three types of metadata: file/chunk namespaces, file→chunk mapping, chunk locations

  • The master stores all metadata in-memory.

    • Namespaces and file→chunk mappings are written to a WAL.
    • Chunk locations are retrieved from the chunkservers if the master crashes or otherwise needs to restart, and periodically via heartbeats thereafter.
  • Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers.

  • In practice the master stores 64 bytes per chunk (upper bound: 15.6M chunks per GB of memory) and 64 bytes of namespace data per file.

  • The master doesn’t store chunk locations persistently because it’s too hard (and unnecessary) to keep this data in sync with the actual chunkservers

  • The WAL (or operation log) provides a total order of fs events

    • Each file and chunk’s ID is the position of its respective create operation in this order (possibly hashed)
    • The log is replicated (synchronously, but batched) to multiple remote machines in addition to being written to locally
    • Master recovers from a crash/shutdown by replaying the WAL
    • State is checkpointed to disk as a B-Tree (is this also replicated?) now and then so the amount of WAL to replay is minimized

Consistency

  • A region of a chunk can be inconsistent, consistent, or defined in the context of a single mutation

    • Inconsistent: all clients potentially see different data
    • Consistent: all clients see the same data regardless of replica
    • Defined: consistent + all clients see the specific mutation in its entirety
  • A write causes data to be written at a given offset

  • A record append causes data to be appended at least once at an offset that GFS can choose

  • Guarantees

    • File namespace mutations (creation/deletion/etc.) are atomic
    • Writes
      • Non-concurrent successful writes to a chunk leave the written portion defined
      • Concurrent successful writes to a chunk leave the written portion consistent but not defined
        • This typically ends up as an interleaving of all the writes
        • Is each write atomic in this context? No, because a write specifies a starting offset, so it potentially conflicts with other writes using overlapping offsets. Use record appends instead if you can’t guarantee that you’re the only writer to a given region
      • Failed writes to a chunk leave the written portion (if any) inconsistent
    • Record Append
      • GFS picks an offset and writes to it, and then returns that offset to the client
      • The range represented by $\text{offset} + \text{the length of the data}$ is a record
      • The record is defined, but GFS may insert padding or duplicate data around it, which is inconsistent
      • Records that failed to fully write are inconsistent
  • A sequence of successful mutations leaves behind a defined region or record.

    • sequence is key here, I think. Is there a way this could be true for concurrent writes?
  • Mutations are applied to all replicas in the same order. Stale replicas (because the chunkserver went down, for example) are detected, and mutations are never applied to them

  • Application level considerations *

    Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state.

    • In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents.

    • These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google.

3. System Interactions

Leases

  • Leases to enforce consistent mutation ordering across replicas; not required for reads
  • The master gives one replica the lease; this becomes the primary.
  • A lease lasts 60 seconds by default, but the primary can request extensions as long as mutations continue to occur.
    • What if the lease expires during a write and the extension is denied?
  • The primary picks an ordering for all mutations to that chunk, not just for a single client; “global mutation order”
  • Leases and extensions piggyback on heartbeat messages
  • Write flow
    • Client→master: which chunkserver is the primary for this chunk?
    • Master grants a lease to a replica (if a primary doesn’t already exist)
      • If a lease is granted, the master increments its version number for that chunk in memory, and waits until all replicas have written this to disk.
      • Once that’s done, it writes the new version number to disk.
    • The master responds to the client with the primary and a list of secondaries
    • Client pushes data to all replicas (in any order); the replicas cache this data
      • The client doesn’t need to manually send the data out $N$ times though; each replica can forward data to other replicas.
      • Client sends a write to a replica, that replica sends to another, and so on until all replicas have the write
      • “Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.”
      • This reduces wasted bandwidth; each chunkserver sends out only one copy of the data
    • Once all replicas have acked this data, the client sends a write request to the primary
    • The primary assigns a serial number to each mutation requested, and applies the mutations in that order
    • The primary forwards the write request to all secondaries along with the assigned serial numbers
    • Secondaries apply the mutations in the same order and then ack the write to the primary
      • What happens if the acked data has been evicted from the cache by this point?
    • After the primary receives acks from all secondaries, it reports a successful write to the client
    • Write errors
      • If the write fails at one or more secondaries, the write isn’t rolled back on the other secondaries or the primary.
      • The region is simply considered inconsistent
      • The primary then reports a failed write to the client, which then presumably retries
  • The client needs to break a write up into multiple writes if it spans chunks
  • Record Append flow
    • Same as the write flow with a few small differences
    • When the primary accepts a write, it checks if accepting that write would overflow the chunk.
      • If so, it pads the chunk to the max size and asks the secondaries to do the same
      • why perform padding at all? what if the next append is smaller?
      • Once they’ve acked, it asks the client to retry on the next chunk
      • Max record size is $\frac{1}{4}$ the chunk size
    • In the happy case, the primary tells the secondaries the exact offset at which to perform the append

Snapshots

  • Uses copy-on-write so snapshots are generated instantaneously
  • Snapshot flow
    • Master receives snapshot request
    • It revokes all leases on chunks that are part of the snapshot
    • Write a record of the snapshot to the WAL
    • Duplicate namespace metadata to create a new file / directory tree
    • When a client then wants to write to one of the chunks involved, it must query the master for a lease
      • Because the primaries for these chunks will have stopped accepting writes after their leases were revoked
    • The master checks the reference count for the chunk (the number of copies it represents)
    • If the reference count is >1, it asks each chunkserver to create a physical copy of each chunk locally
    • file→chunk metadata is updated (presumably - the paper doesn’t mention this)
    • One of these new chunks then gets the lease and that info is sent to the client

4. Master Operation

Locking

  • State is locked over regions of the namespace so long-running operations don’t block other operations
  • The VFS is like S3: no explicit directories, just a map of pathname to file or directory metadata
    • The actual representation in memory is prefix-compressed
  • Each node in the tree has a RW lock (multiple readers, but writes are exclusive), allocated lazily
  • Each metadata operation acquires read locks on all but the penultimate node in the path as well a read or write lock on the final node itself
  • This allows multiple writes to occur in the same directory
  • Locks are acquired in a deterministic order to avoid deadlock:

    locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level

Chunk Replica Creation

  • The master decides where to place new chunk replicas based on:
    1. Disk utilization
    2. Low “recent” replica creations
      • A new chunk is almost always followed by write traffic to populate that chunk
      • This helps spread out the write load
    3. Spread across racks
      • Increase the probability of chunks being closer to clients
  • The master re-replicates a chunk when it becomes underreplicated:
    • If many re-replications are pending, prioritize based on:
      • Number of missing replicas
      • Whether the chunk is “live” or has been deleted
      • Is a client blocked waiting for this chunk to be replicated?
    • The master picks a chunk clones it onto a new chunkserver
    • Are all ongoing writes to that chunk blocked while this is happening? If not, how does the new chunk catch up?
  • The master rebalances replicas now and again:
    • Improves disk utilization and IO bandwidth
    • A new chunkserver gradually has replicas rebalanced onto it, so it isn’t swamped with heavy write load

Garbage Collection

  • Files are deleted by editing the metadata namespace; the actual replicas are deleted lazily
  • When a delete is logged, the file is renamed to a hidden name that includes deletion time
  • The master periodically scans the fs namespace and deletes any chunks that were deleted >3 days ago (configurable)
  • GC is expedited when a file is deleted soon after it was created (temporary files)
  • Clients / users can apply different GC strategies to different parts of the namespace

Stale Replicas

  • Master keeps track of the version of each replica.
  • The version number is incremented every time a lease to that chunk is generated
  • The new version number must be written to persistent storage on the master and all replicas before the lease is returned to the client
  • If a replica falls behind (node restart, etc.), its version number will be out of date, and the master pretends it doesn’t exist (and eventually GCs it)
  • What happens if a chunk server falls behind after a lease is granted? Is that client SOL until the lease expires?

5. Fault Tolerance

  • All nodes can load their state + start in seconds; no difference between crash-start and stop-start
  • Master data (checkpoint + WAL) is replicated onto multiple machines, but these other machines aren’t running master processes
  • If the master crashes, and external monitor starts a new master process on one of these other machines
    • All clients + chunk servers use a DNS record for master comms, and the DNS record is switched during this step
    • Split brain? What if the external monitor makes a mistake?
  • Shadow masters are master replicas that do run master processes, but only respond to read requests
    • Metadata from these masters can be stale, so clients should only use this for seldom-mutated data
    • Note that file data can’t be stale because they live on the chunkservers
  • Data on-disk is checksummed to guard against corruption
    • Chunks are broken into 64KB blocks; each gets a 32-bit checksum
    • Checksums are verified before responding to a read request
    • In the event of an invalid checksum, the client is instructed to read from a different replica, and the master is notified so it can create a new replica
  • RPC logs can recreate/trace the entire distributed execution to diagnose issues

6. Benchmarks

Pasted image 20210726233808.png

I skipped/skimmed most of this section

7. Experiences

Some of our biggest problems were disk and Linux related. Many of our disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones. Since the protocol versions are very similar, these drives mostly worked, but occasionally the mismatches would cause the drive and the kernel to disagree about the drive’s state. This would corrupt data silently due to problems in the kernel. This problem motivated our use of checksums to detect data corruption, while concurrently we modified the kernel to handle these protocol mismatches.

Earlier we had some problems with Linux 2.2 kernels due to the cost of fsync(). Its cost is proportional to the size of the file rather than the size of the modified portion. This was a problem for our large operation logs especially before we implemented checkpointing. We worked around this for a time by using synchronous writes and eventually migrated to Linux 2.4.

Despite occasional problems, the availability of Linux code has helped us time and again to explore and understand system behavior. When appropriate, we improve the kernel and share the changes with the open source community.

Edit