How fast are Linux pipes anyway?

https://mazzo.li/posts/fast-pipes.html

  • Inspired by this FizzBuzz implementation that hits 40GB/s by avoiding main memory as much as possible
  • Say we write a simpler writer program that writes 256KB blocks to stdout forever, and a reader program that reads 10GB worth of these blocks from stdin.
  • Use the vmsplice syscall instead, which directly points ring buffer entries to userspace pages, which means no page allocations or copying.
    • Write 128KB blocks this time, and set the pipe size (via F_SETPIPE_SZ) to 128KB (the default is 16 pages)
    • Use a pair of 128KB buffers, write to one, vmsplice the other into the stdout pipe, and vice versa
    • This ensures that a splice waits for a previous splice to be fully consumed by the reader before reusing the buffer, because the pipe is full after each splice
      • Is this sufficient if the reader is using splice as well? Shouldn’t the writer wait until the reader is done processing the buffer? If not this scenario seems possible:
        • Writer writes buffer, splices it into the pipe, reader reads buffer off the pipe
        • Writer writes new data to the buffer, reader is flushing the buffer to stdout concurrently
    • With vmsplice on the write end but read on the read end, throughput jumps to ~12GB/s
    • And with splice on the read end, another jump to ~32GB/s
  • Next, the write program is spending 17% of time in iov_iter_get_pages
    • Which converts virtual page(-range)s to struct page entries representing physical pages
    • For each virtual page it walks (in software) the 4-level (5-level as of Ice Lake apparently) page table to find a struct page if one exists
    • If one doesn’t (overcommit), it either allocates a physical page and returns it, or returns a struct page that isn’t backed by physical memory
      • As far as I can tell this choice is made based on the flags passed in
      • But either one requires holding mmap_lock, so this is the slow path
    • Use 2MB hugepages (128KB uses a single page instead of 32 pages) for a 50% improvement (50GB/s)
      • I’m assuming this is because the in-software traversal of the page table can’t use the TLB
      • Is there no instruction that can resolve a virt->physical address in hardware? LEA?
    • Does this use a single page for each 128KB buffer or for both together?
  • Spin against vmsplice (instead of waiting async) when the pipe is busy for another 25% improvement (62GB/s)
Edit