How fast are Linux pipes anyway?

https://mazzo.li/posts/fast-pipes.html

Inspired by this FizzBuzz implementation that hits 40GB/s by avoiding main memory as much as possible
Say we write a simpler writer program that writes 256KB blocks to stdout forever, and a reader program that reads 10GB worth of these blocks from stdin.
- The basic implementation using the write and read syscalls only hits ~4GB/s
- Pipes are implemented as ring buffers in kernel space, and write works this way:
  1. If the pipe is already full, wait for space and restart;
  2. If the buffer currently pointed at by head has space, fill that space first;
  3. While there’s free slots, and there are remaining bytes to write, allocate new pages and fill them, updating head.
- Each page has to be copied from userspace to kernelspace and back again, and every write (potentially) has to wait to allocate new pages which may not be contiguous in physical memory
Use the vmsplice syscall instead, which directly points ring buffer entries to userspace pages, which means no page allocations or copying.
- Write 128KB blocks this time, and set the pipe size (via F_SETPIPE_SZ) to 128KB (the default is 16 pages)
- Use a pair of 128KB buffers, write to one, vmsplice the other into the stdout pipe, and vice versa
- This ensures that a splice waits for a previous splice to be fully consumed by the reader before reusing the buffer, because the pipe is full after each splice
  - Is this sufficient if the reader is using splice as well? Shouldn’t the writer wait until the reader is done processing the buffer? If not this scenario seems possible:
    - Writer writes buffer, splices it into the pipe, reader reads buffer off the pipe
    - Writer writes new data to the buffer, reader is flushing the buffer to stdout concurrently
- With vmsplice on the write end but read on the read end, throughput jumps to ~12GB/s
- And with splice on the read end, another jump to ~32GB/s
Next, the write program is spending 17% of time in iov_iter_get_pages
- Which converts virtual page(-range)s to struct page entries representing physical pages
- For each virtual page it walks (in software) the 4-level (5-level as of Ice Lake apparently) page table to find a struct page if one exists
- If one doesn’t (overcommit), it either allocates a physical page and returns it, or returns a struct page that isn’t backed by physical memory
  - As far as I can tell this choice is made based on the flags passed in
  - But either one requires holding mmap_lock, so this is the slow path
- Use 2MB hugepages (128KB uses a single page instead of 32 pages) for a 50% improvement (50GB/s)
  - I’m assuming this is because the in-software traversal of the page table can’t use the TLB
  - Is there no instruction that can resolve a virt->physical address in hardware? LEA?
- Does this use a single page for each 128KB buffer or for both together?
Spin against vmsplice (instead of waiting async) when the pipe is busy for another 25% improvement (62GB/s)