TCPIP Illustrated, Vol 1
1: Introduction
-
First edition, 1994
-
during the 1990s we have come to realize that this new, bigger island consisting of a single network doesn’t make sense either. People are combining multiple networks together into an internetwork, or an internet. An internet is a collection of networks that all use the same protocol suite.
-
The application and transport layers are end-to-end in that they’re untouched by intermediate systems. The network layer and lower are the opposite, every router rewrites headers at those layers
-
Bridges connect networks at the link layer, routers connect networks at the network layer
- IP (v4) address classes:
- A: 0.0.0.0 -> 127.255.255.255
- B: 128.0.0.0 -> 191.255.255.255
- C: 192.0.0.0 -> 223.255.255.255
- D: 224.0.0.0 -> 239.255.255.255
- E: 240.0.0.0 -> 255.255.255.255
- IP (v4) address classes:
-
Three types of IP addrs: unicast, broadcast, multicast (anycast?)
-
Packet encapsulation
-
Headers typically include a field denoting the (higher-level) protocol that their payload carries
-
TCP servers are typically concurrent, UDP servers are typically iterative, because it rarely makes sense to use concurrent “connections” for a connectionless protocol
-
Well known port numbers are listed in
/etc/services
2: Link Layer
- Hardware (MAC) addresses are typically 48 bits long
- The ARP/RARP protocols map between hardware and IP addresses
- Loopback interface to allow a client and server on the same host to communicate with each other using TCP/IP. Most implementations don’t short-circuit the TCP layer when going over the loopback interface. IP packets are prepared and sent out, but no lower layers are involved.
- The network layer (IP) fragments data into multiple packets because lower levels impose a hard limit (MTU) on frame size
- Different networks can have different MTUs, so fragmentation can occur at any hop, not just at the source (essentially anytime the IP header is rewritten to use a new destination IP, the payload could also be fragmented to accommodate a smaller MTU)
- Are fragmented packets ever reassembled? Any ordering guarantees between the fragmented packets?
3: IP: Internet Protocol
-
Connectionless, unreliable datagram service
-
No ordering guarantees
-
Big endian ordering regardless of the endianness of the machines involved (network byte order)
-
The TOS field allowed specifying one of these optimizations: minimize delay, maximize throughput, maximize reliabiity, minimize cost
- But is now (as of 1998) deprecated
-
TTL sets an upper limit on the number of hops the packet can take
-
Header checksum only, payloads must have their own checksum
- 16-bit 1’s complement of the 1’s complement sum (carries are added) of all 16-bit words in the header
- https://github.com/timothyandrew/tcp/blob/91cc59b/header.go#L15-L37
- Each router incrementally updates the checksum as the TTL changes
-
Routing
- When a node receives an IP packet whose destination address doesn’t match one of its own, it can be configured to route the packet onwards using a routing table
- Entries in the routing table map a given destination address (for a single node [/32] or a network) to the IP address of a “next-hop router”
- Packets are forwarded to the next hop router, which requires rewriting all the headers lower down the stack than IP (a different MAC address for example)
- Routers are typically connected to multiple NICs and forward packets from one to another
-
The ability to specify a route to a network, and not have to specify a route to every host, is another fundamental feature of IP routing. Doing this allows the routers on the Internet, for example, to have a routing table with thousands of entries, instead of a routing table with more than one million entries.
-
Subnets
- Reserve portions of the IP address for sub-networks that are transparent externally
- As opposed to having each of these sub-networks advertised to the internet individually
- Subnet masks determine (for a given host) how many bits are used for the network/subnet ID and how many for the host ID
- The book divides an IP address up into “(network ID, subnet ID, host ID)”, but this is outdated (as of 1993)
- The current strategy is CIDR (classless inter-domain routing) where IPs are just (network/subnet id, host ID)
- https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing
- Subnets are typically /24, but can be larger
4: ARP: Address Resolution Protocol
- Provides a mapping between hardware addresses and IP addresses
- ARP uses a broadcast mechanism to query for the hardware address of a given IP address
- Broadcasts use a special hardware address: all ones
- Mappings are cached on each host (
arp -e
)
❯ sudo tshark -i enp0s10 -f "arp"
1 0.000000000 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
2 1.027814907 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
3 2.051482603 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
4 3.075296457 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
5 4.099433088 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
6 5.124312772 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
7 11.909737922 e2:9b:eb:e0:63:1a → Broadcast ARP 42 Who has 192.168.1.5? Tell 192.168.1.26
8 11.916431821 Netgear_ff:e1:2d → e2:9b:eb:e0:63:1a ARP 60 192.168.1.5 is at 08:36:c9:ff:e1:2d
9 17.024220603 Netgear_ff:e1:2d → e2:9b:eb:e0:63:1a ARP 60 Who has 192.168.1.26? Tell 192.168.1.5
10 17.024277268 e2:9b:eb:e0:63:1a → Netgear_ff:e1:2d ARP 42 192.168.1.26 is at e2:9b:eb:e0:63:1a
11 33.715456792 Netgear_ff:e1:2d → Broadcast ARP 60 Who has 192.168.1.1? Tell 192.168.1.5
- A node can send an ARP request for its own IP to see if any other node on the network is using that IP
5: RARP: Reverse Address Resolution Protocol
- Get the IP address for a given hardware address
- Usually used by a host to figure out its own IP when booting up
- Supplanted by bootp and later DHCP
6: ICMP: Internet Control Message Protocol
- Communicates error & query messages
- ICMP error payloads contain the header of the IP packet that generated the error, as well as the first 8 bytes of its payload:
- ICMP can be used for timestamp queries (predating NTP?)
9: IP Routing
- Routing tables map host/network IDs to gateways (and the interfaces those gateways can be reached on), either specifically or via defaults
❯ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 192.168.1.1 0.0.0.0 UG 0 0 0 enp0s10
10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 tun_tcp
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 enp0s10
-
Flags: U->UP, G->Gateway (if unset, the destination is directly connected), H->Host (destination is not a network)
-
The book has
127.0.0.0/8
innetstat -r
, output, but this doesn’t show up in Linux-
Linux uses multiple routing tables:
❯ ip rule 0: from all lookup local 32766: from all lookup main 32767: from all lookup default
-
And the
local
table looks like:
❯ ip route list table local broadcast 10.0.0.0 dev tun_tcp proto kernel scope link src 10.0.0.1 local 10.0.0.1 dev tun_tcp proto kernel scope host src 10.0.0.1 broadcast 10.0.0.255 dev tun_tcp proto kernel scope link src 10.0.0.1 broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1 broadcast 192.168.1.0 dev enp0s10 proto kernel scope link src 192.168.1.26 local 192.168.1.26 dev enp0s10 proto kernel scope host src 192.168.1.26
-
-
Routing error to the calling application when a route can’t be found on the same machine that the datagram originated from
-
ICMP “host unreachable” error is sent for routing errors on intermediate machines/routers
-
Hosts typically ignore rather than forward packets whose final destination is a different host. Linux can be configured to act as a router with the
net.ipv4.ip_forward
sysctl -
ICMP redirect
- Possibly a rudimentary form of dynamic routing
- When a router receives a packet and the routing decision it makes sends it back out on the same interface
- That’s a clue that the router that sent the packet can skip the current router entirely
- So say A sends a packet to B and B forwards it to C, but B both receives and forwards the packet on the same interface
- In which case it’s likely that A and C are directly connected and B is an unnecessary hop
- Here B would send an ICMP redirect message to A telling it about C
-
ICMP can also be used to discover local routers
- Either by broadcasting a router solicitation message
- Or waiting for a periodic broadcast from the router (router advertisement)
10: Dynamic Routing Protocols
- Dynamic routing: routers broadcast routes to adjacent routers using a routing protocol
- The Internet is organized into a collection of autonomous systems (ASs), each of which is normally administered by a single entity.
- Dynamic routing within an AS uses an “interior gateway protocol” like RIP or OSPF
- Routing between routers in different ASs use an “exterior gateway protocol” like EGP or BGP
- RIP: routing information protocol
- Layered over UDP, allows sending 25 routes in a single message
- Each message has a hop count that’s incremented at every propagation
- Each router chooses whether or not to apply a route based on the hop count
- No notion of subnetting, only network ID or host ID
- RIP v2 supports cross-AS routing, and contains a header field with an AS number
- OSPF: open shortest path first
- Each router tests the state of its links to all adjacent nodes
- And sends this info for each node to all other adjacent nodes
- Stabilizes faster than RIP after a partition or a device going down
- “State of a link” is a cost model based on any dimension, like throughput, RTT, reliability, etc.
- Cost-based load balancing when multiple valid routes exist
- BGP: border gateway protocol
- Two systems running BGP establish a TCP connection and exchange their entire routing tables, after which the connection stays open for incremental updates
- Once this happens between many ASes each one has built up a graph of AS connectivity, not just the very next hop
- Three types of ASes:
- Stub: connected to only one other AS, but only carries traffic destined for nodes in the AS
- Multihomed: connected to more than one other AS, but only carries traffic destined for nodes in the AS
- Transit: connected to more than one AS, and carries both local and transit traffic
11: UDP: User Datagram Protocol
- Because IP headers have a protocol field, UDP and TCP port numbers each occupy entirely different namespaces, even on Linux.
- Checksum covers payload, unlike IP, and includes a pseudo-header with fields from the IP header
- UDP + IP fragmentation could be problematic. If one of the fragmented packets is lost, the receiver can’t reassemble the original packet unless the host retransmits the entire thing
- Path MTU: the smallest MTU in the entire routed path to the target
- If the IP header says “don’t fragment” but an intermediate router needs to fragment because the incoming MTU is too large, it drops the packet and sends an ICMP “fragmentation required” error
- Can use this ICMP error (like traceroute) to determine the path MTU by choosing a large value and decrementing until the error no longer shows up
12: Broadcasting & Multicasting
- Doesn’t make sense for TCP, only connectionless protocols
- NICs typically see every ethernet frame but only receive ones that have the right address (host MAC or the broadcast address)
- NICs can be placed in promiscuous mode where they receive all frames
Broadcast
- Broadcast IPs
- Limited broadcast IP: 255.255.255.255
- Never forwarded by a router, only local
- Net-directed broadcast / all-subnets-directed broadcast:
<net_id>.255.255.255
- Subnet-directed broadcast: subnet ID followed by all ones
- Limited broadcast IP: 255.255.255.255
- This works with things like
ping
for example:❯ ping -b 255.255.255.255 WARNING: pinging broadcast address PING 255.255.255.255 (255.255.255.255) 56(84) bytes of data. 64 bytes from 192.168.1.3: icmp_seq=1 ttl=64 time=1.11 ms 64 bytes from 192.168.1.15: icmp_seq=1 ttl=64 time=15.4 ms 64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=299 ms 64 bytes from 192.168.1.18: icmp_seq=1 ttl=64 time=299 ms 64 bytes from 192.168.1.8: icmp_seq=1 ttl=64 time=299 ms 64 bytes from 192.168.1.3: icmp_seq=2 ttl=64 time=1.35 ms 64 bytes from 192.168.1.15: icmp_seq=2 ttl=64 time=9.49 ms 64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=191 ms 64 bytes from 192.168.1.18: icmp_seq=2 ttl=64 time=194 ms 64 bytes from 192.168.1.8: icmp_seq=2 ttl=64 time=194 ms 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=522 ms 64 bytes from 192.168.1.3: icmp_seq=3 ttl=64 time=1.05 ms 64 bytes from 192.168.1.15: icmp_seq=3 ttl=64 time=21.6 ms 64 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=112 ms 64 bytes from 192.168.1.18: icmp_seq=3 ttl=64 time=114 ms 64 bytes from 192.168.1.8: icmp_seq=3 ttl=64 time=114 ms 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=223 ms ^C --- 255.255.255.255 ping statistics --- 3 packets transmitted, 3 received, +14 duplicates, 0% packet loss, time 2008ms rtt min/avg/max/mdev = 1.054/153.604/522.394/141.550 ms
Multicast
- Hosts can elect to be a part of host groups (via IGMP).
- A multicast group address is the combination of the high-order 4 bits of 1110 and the multicast group ID. These are normally written as dotted-decimal numbers and are in the range 224.0.0.0 through 239.255.255.255.
- Some multicast addresses are well-known, such as:
For example, 224.0.0.1 means “all systems on this subnet,” and 224.0.0.2 means “all routers on this subnet.” The multicast address 224.0.1.1 is for NTP, the Network Time Protocol, 224.0.0.9 is for RIP-2 (Section 10.5), and 224.0.1.2 is for SGI’s (Silicon Graphics) dogfight application.
13: IGMP: Internet Group Management Protocol
- Hosts send IGMP messages saying they’re either joining or leaving a given multicast group
- Multicast routers store this mapping and refresh it periodically by sending IGMP queries
- Routers only need to know whether a given group has at least one active host
- When the router receives an IP packet destined for the group, it multicasts it into the local network if a receiver exists for it
14: DNS
-
Hierarchical namespace
-
Fully qualified domain names (FQDNs) end with a period
-
A zone is a subtree that’s managed separately
-
Each domain at the second level and below can have authoritative name servers that cover that zone
-
Theres a set of root name servers with known IPs that know all the authoritative name servers for second-level domains (TLDs)
-
A DNS packet contains a variable number of: question, answer, authority, and additional fields
-
Reverse DNS Queries
- To look up the domain for an IP, use the pseudo domain
in-addr.arpa
- Specifically for IP
A.B.C.D
, look for aPTR
record on the domainD.C.B.A.in-addr.arpa.
, which should give you the domain(s) that point toA.B.C.D
❯ dig news.ycombinator.com news.ycombinator.com. 1 IN A 50.112.136.166 ❯ dig -t PTR 166.136.112.50.in-addr.arpa. 166.136.112.50.in-addr.arpa. 191 IN PTR ec2-50-112-136-166.us-west-2.compute.amazonaws.com.
- To look up the domain for an IP, use the pseudo domain
-
Record types
- A: defines an IP address
- PTR: for reverse (IP->name) queries
- CNAME: “canonical name”, for aliasing
- MX: mail exchange
- NS: “name server”; specify the authoritative name server for a domain
- SOA: “start of authority”, used to designate the primary name server and administrator responsible for a zone; the presence of these records indicate the root of a zone
-
The DNS protocol includes a “truncated” flag for large responses; if this happens the client is to redo the query using TCP
-
DNS lookup flow (for example: news.ycombinator.com)
-
Stub resolver (on a client machine) forwards query to recursive resolver
-
Recursive resolver looks up (or knows) the root nameservers, and then:
-
Sends query
com.
to a root nameserver and receives authoritative nameservers forcom.
-
Sends query
ycombinator.com.
to acom.
nameserver and receives authoritative nameservers forycombinator.com.
-
Sends query
news.ycombinator.com.
to aycombinator.com.
nameserver and receives an A record -
A basic resolver is actually pretty simple to write:
nameserver := "198.41.0.4" // one of the 13 root nameservers answer := []dns.RR{} c := new(dns.Client) for { fmt.Printf("Querying for %s against nameserver %s\n", name, nameserver) m := new(dns.Msg) m.SetQuestion(name, dns.TypeA) in, _, err := c.Exchange(m, fmt.Sprintf("%s:53", nameserver)) if err != nil { panic(err) } if len(in.Answer) > 0 { answer = in.Answer break } if len(in.Extra) == 0 { panic("EMPTY RESPONSE") } rr := in.Extra[0].(*dns.A) nameserver = rr.A.String() } if answer == nil { panic("COULDN'T RESOLVE") } fmt.Println(answer)
-
With the caveat that it doesn’t work when an intermediate step returns an NS record in the authority section without the IPs of that NS record in the additional section (which would normally require an extra out-of-band lookup)
-
But it works fine when all intermediaries return A records for nameservers in the additional section:
❯ ./dns news.ycombinator.com. Querying for com. against nameserver 198.41.0.4 Querying for ycombinator.com. against nameserver 192.12.94.30 Querying for news.ycombinator.com. against nameserver 205.251.192.225 [news.ycombinator.com. 1 IN A 50.112.136.166]
-
17: TCP: Transmission Control Protocol
- Connection-oriented protocol to provide a reliable byte-stream over an unreliable medium
- No markers: writes don’t necessarily correspond with reads. Blocks that are written 4kB at a time may be read 2kB at a time (or 50kB at a time)
- Each byte in the byte stream has a “sequence number”
- Sequence numbers wrap after
2^32 - 1
and are selected randomly (to start with) - The
SYN
andFIN
messages during connection setup/teardown each consume a sequence number
- Sequence numbers wrap after
- Connections are identified by
(source port, source IP, dest port, dest IP)
- In a TCP header:
- Sequence number: identifies the first byte in the current packet’s payload -OR- identifies the initial sequence number if the
SYN
flag is set - Acknowledgment number: the next sequence number the sender of this packet expects to receive, only meaningful if the
ACK
flag is set - Window size: number of bytes the sender of this packet has available to store incoming data (flow control)
- Sequence number: identifies the first byte in the current packet’s payload -OR- identifies the initial sequence number if the
18: TCP Connection Establishment and Termination
- Establishing a connection
- Three-way handshake
- Client sends a SYN specifying the start of the client’s sequence space
- Server sends a SYN specifying the start of the server’s sequence space
- Server sends an ACK to acknowledge receipt of the sequence number sent in 1.
- Client sends an ACK to acknowledge receipt of the sequence number sent in 2.
- Steps 2 & 3 are typically combined, so it’s SYN,SYN+ACK,ACK
- Three-way handshake
- Closing a handshake
- One side can close the connection and continue receiving data from the other
- The closer must send ACKs as normal without sending any data
- This could be a reasonable method to signal
EOF
- Four-step termination
- One side sends a FIN
- Other side sends an ACK to acknowledge the FIN
- (Later…) Other side sends a FIN
- The initial closer sends an ACK to acknowledge the FIN from 3.
- One side can close the connection and continue receiving data from the other
- Maximum Segment Size (MSS)
- This is an “option” that can be sent with a SYN to announce the largest-sized segment the receiver is willing to receive
- Used to avoid fragmentation
- Connection states
- All states + transitions:
- States during connection/termination:
- Connections must stay in
TIME_WAIT
for 2x the “maximum segment lifetime” to avoid mixing up segments between connections.- All segments received against a
TIME_WAIT
connection are discarded. - Can set the
SO_REUSEADDR
flag (tosocket
) to allow conflicts withTIME_WAIT
sockets, which is required to restart servers without waiting for MSL expiry
- All segments received against a
- A connection can be stuck in the half-closed
FIN_WAIT_2
state forever, so most implementations use a timeout here
- Quiet time
- Wait for
MSL
seconds after a crash to avoid sending stale segments past the MSL
- Wait for
- Reset/
RST
- TCP allows for simultaneous opens, where two machines connect to each other at well-known ports at the same time, leading to one single connection, not two
- This is hard to artificially replicate - can only be triggered if both SYNs are in flight simultaneously
- Also simultaneous closes, where both sides independently send FINs at the same time
- When a FIN is received when in FIN_WAIT_1 but the accompanying ACK doesn’t cover the FIN that was just sent out
- Then it must be a simultaneous close
- When a server’s accept queue is full, it typically (this is true as of Linux 5.18.0) drops incoming SYNs without sending back a RST, encouraging retransmission
19: TCP Interactive Data Flow
- Delayed acknowledgements
- Wait for a bit (up to 200ms) to allow for new data to piggyback on the ACK
- Disable this on Linux with
TCP_QUICKACK
- Nagle’s Algorithm
- Coalesce many small payloads into fewer, larger payloads
- Only allow a single un-ACKed small segment to be in flight at a given time
- In the meantime, small segments are buffered until an ACK comes back
- When the buffered data becomes larger than the segment size, it isn’t considered “small” anymore, and can be sent without waiting for an ACK
- Disable this on Linux with
TCP_NODELAY
20: TCP Bulk Data Flow
-
If an ACK is sent with a small (or zero) window size, TCP may send a second ACK once the window grows larger (if no data is received in that interval). This is called a window update
-
Receipt of out-of-order segments must trigger duplicate ACKs
-
Sliding window
- And from the RFC:
Send Sequence Space 1 2 3 4 ----------|----------|----------|---------- SND.UNA SND.NXT SND.UNA +SND.WND 1 - old sequence numbers which have been acknowledged 2 - sequence numbers of unacknowledged data 3 - sequence numbers allowed for new data transmission 4 - future sequence numbers which are not yet allowed Send Sequence Space Figure 4.
-
The
PSH
flag is used to tell a receiver to immediately flush the read buffer to the application and not wait around for more data. Sounds like this was already semi-deprecated in 1992, althoughnc
still sends it: -
Slow start
- Senders transmitting enough data to fill the receiver’s window may overwhelm intermediary routers/etc.
- Senders maintain a congestion window, which starts at one segment in length
- Every ACK increases the size of this window by one segment
- The sender doesn’t ever transmit past the congestion window
21-23: TCP Timeout and Retransmission + Other Timers
- Four timers
- Retransmission: senders set a timer and retransmit if not ACKed when the timer fires
- Persist: keep window updates flowing
- Keepalive: detect disconnects/crashes on an idle connection
- 2MSL: move a connection from
TIME_WAIT
toCLOSED
Retransmission Timer
-
Retransmission
- Retransmission timeouts are based on the measured RTT of the network, and use a constant factor (either 2 or 4 according to this book) to create exponential backoff
- The RTT is measured by recording the time between sending a segment and having it ACKed. This doesn’t work for a retransmission though, because you can’t tell if the ACK was for the original segment or the retransmitted on. Ambiguous RTT values are not used in the timeout calculation.
-
Congestion avoidance
- Slow start isn’t sufficient, because at some point you’re going to hit the limit of an intervening router/etc. anyway
- Keep track of an
ssthresh
variable: this is number of segments at which slow start stops and a congestion avoidance algorithm takes over - A new connection starts off with a congestion window of 1 segment. This repeatedly doubles as ACKs are recevied (this is slow start), until the congestion window’s size passes
ssthresh
. - At this point the size of the congestion window is now controlled by a congestion avoidance algorithm, which is more conservative than slow start.
- Here the graph flattens once the connection switches from slow start to a congestion avoidance algorithm:
-
Fast retransmit
-
Receivers send a duplicate ACK when receiving a segment that’s out of order, like
ack 6657
here: -
When a sender receives a duplicate ACK, this may mean one of:
- The receiver received segments out of order, but all segments are reliably delivered
- The reciever received segments out of order because one segment was irrevocably dropped
-
In the first case we don’t really want to retransmit, but in the second we do. To disambiguate, wait for three duplicate ACKs, which is strong signal that we’re seeing the second scenario. In this case, perform a retransmit immediately.
-
After performing a fast retransmit, apply congestion control because we’re assuming that a segment was dropped due to congestion. Here are two common schemes (from Wikipedia): *
Tahoe: if three duplicate ACKs are received (i.e. four ACKs acknowledging the same packet, which are not piggybacked on data and do not change the receiver’s advertised window), Tahoe performs a fast retransmit, sets the slow start threshold to half of the current congestion window, reduces the congestion window to 1 MSS, and resets to slow start state.
-
Reno: if three duplicate ACKs are received, Reno will perform a fast retransmit and skip the slow start phase by instead halving the congestion window (instead of setting it to 1 MSS like Tahoe), setting the ssthresh equal to the new congestion window, and enter a phase called fast recovery.
-
-
-
Some of this data (RTT, congestion window size, ssthresh) are saved against the route (
ip route
) in the routing table for future connections -
Use
ss
to check these metrics for a connection:❯ ss -ti 'sport == 4001 || dport == 4001' State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 127.0.0.1:50294 127.0.0.1:4001 cubic wscale:7,7 rto:204 rtt:0.059/0.027 mss:32768 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:12 bytes_acked:13 segs_out:4 segs_in:3 data_segs_out:2 send 44.4Gbps lastsnd:27768 lastrcv:71148 lastack:27768 pacing_rate 87.9Gbps delivery_rate 7.71Gbps delivered:3 app_limited rcv_space:65495 rcv_ssthresh:65495 minrtt:0.034 snd_wnd:65536 ESTAB 0 0 127.0.0.1:4001 127.0.0.1:50294 cubic wscale:7,7 rto:200 rtt:0.057/0.028 ato:40 mss:32768 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_received:12 segs_out:2 segs_in:4 data_segs_in:2 send 46Gbps lastsnd:71148 lastrcv:27768 lastack:69836 pacing_rate 92Gbps delivered:1 rcv_space:65483 rcv_ssthresh:65483 minrtt:0.057 snd_wnd:65536
-
Retransmits don’t have to resend the same exact packet that was sent the first time. More data can be stuffed in there if necessary.
Persist Timer
- If both the sender and the receiver have zero windows/full buffers (or an ACK is lost), it’s possible for the connection to be deadlocked
- To avoid this, senders use a persist timer to periodically check whether the receiver is now able to accept data
- Silly window syndrome: receievers advertise small windows instead of waiting and advertising larger windows to minimize overhead
Keepalive Timer
- TCP connections are kept alive simply by each peer holding connection state, but no active measures (like polling) are necessary to maintain a connection
- This only applies when both hosts have not crashed though. It’s possible for one host to crash, and for the other to think the connection is still up when it isn’t
- TCP implementations (but not the RFC) include a keepalive timer to periodically send packets on idle connections to make sure both hosts are up
24: TCP Futures and Performance
Path MTU discovery
- TCP starts off with the MTU of the outgoing interface or the MSS announced by the other side, whichever is smaller, and sets the don’t fragment (DF) bit on the IP packet
- If an intermediate router is unable to transmit this packet because of a smaller MTU, it generates an ICMP message. TCP sees this and retransmits with a smaller segment size.
- Routes can change dynamically, so after a while TCP gradually increases this value back up to the original (after 10 minutes by default according to RFC 1191)
Long Fat Pipes
- Nomenclature
- Define the capacity of a connection to be
bandwidth * RTT
, which is a measure for the max amount of data that can be in flight on that connection in a given instant - Also called the “bandwidth-delay product”, or simply the size of the pipe
- Networks with a high capacity are “long fat” networks, and connections on these networks are “long fat pipes”
- Define the capacity of a connection to be
- Long fat pipes are bad for latency but can be great for throughput
- TCP can’t (by default) optimize for throughput because the window size maxes out at 64kB (16-bit field)
- There’s a “window scale” option that both sides have to use during their
SYN
s - Set the “window scale” to a value between 0 and 14
- The “window” field is then interpreted as
window * (2 ^ scaling_factor)
- Max window size is now 1GB
- There’s a “window scale” option that both sides have to use during their
Timestamp
- Insert a monotonic counter into every segment, the receiver returns this value unchanged with an ACK, and the sender can determine how many times the counter ticked in between
- This is a better means of measuring RTT than having to maintain local state against every transmitted segment, but also:
- Wrapped & duplicated sequence numbers are measured individually
- Retransmitted segments are measured individually
- The timestamp can also be used to guard against wrapped sequence numbers by identifying segments with plausible sequence numbers but stale timestamps relative to other timestamps being received in the same area of the sequence number space