How fast are Linux pipes anyway?
https://mazzo.li/posts/fast-pipes.html
- Inspired by this FizzBuzz implementation that hits 40GB/s by avoiding main memory as much as possible
- Say we write a simpler writer program that writes 256KB blocks to stdout forever, and a reader program that reads 10GB worth of these blocks from stdin.
-
The basic implementation using the
write
andread
syscalls only hits ~4GB/s -
Pipes are implemented as ring buffers in kernel space, and
write
works this way:- If the pipe is already full, wait for space and restart;
- If the buffer currently pointed at by
head
has space, fill that space first; - While there’s free slots, and there are remaining bytes to write, allocate new pages and fill them, updating
head
.
-
Each page has to be copied from userspace to kernelspace and back again, and every write (potentially) has to wait to allocate new pages which may not be contiguous in physical memory
-
- Use the
vmsplice
syscall instead, which directly points ring buffer entries to userspace pages, which means no page allocations or copying.- Write 128KB blocks this time, and set the pipe size (via
F_SETPIPE_SZ
) to 128KB (the default is 16 pages) - Use a pair of 128KB buffers, write to one, vmsplice the other into the stdout pipe, and vice versa
- This ensures that a splice waits for a previous splice to be fully consumed by the reader before reusing the buffer, because the pipe is full after each splice
- Is this sufficient if the reader is using splice as well? Shouldn’t the writer wait until the reader is done processing the buffer? If not this scenario seems possible:
- Writer writes buffer, splices it into the pipe, reader reads buffer off the pipe
- Writer writes new data to the buffer, reader is flushing the buffer to stdout concurrently
- Is this sufficient if the reader is using splice as well? Shouldn’t the writer wait until the reader is done processing the buffer? If not this scenario seems possible:
- With
vmsplice
on the write end butread
on the read end, throughput jumps to ~12GB/s - And with
splice
on the read end, another jump to ~32GB/s
- Write 128KB blocks this time, and set the pipe size (via
- Next, the write program is spending 17% of time in
iov_iter_get_pages
- Which converts virtual page(-range)s to
struct page
entries representing physical pages - For each virtual page it walks (in software) the 4-level (5-level as of Ice Lake apparently) page table to find a
struct page
if one exists - If one doesn’t (overcommit), it either allocates a physical page and returns it, or returns a
struct page
that isn’t backed by physical memory- As far as I can tell this choice is made based on the flags passed in
- But either one requires holding
mmap_lock
, so this is the slow path
- Use 2MB hugepages (128KB uses a single page instead of 32 pages) for a 50% improvement (50GB/s)
- I’m assuming this is because the in-software traversal of the page table can’t use the TLB
- Is there no instruction that can resolve a virt->physical address in hardware? LEA?
- Does this use a single page for each 128KB buffer or for both together?
- Which converts virtual page(-range)s to
- Spin against
vmsplice
(instead of waiting async) when the pipe is busy for another 25% improvement (62GB/s)