Implementing TCP in Rust (Attempt 1 - Rust)
Resources
Questions
- How does Linux deal with port conflicts between
SOCK_RAW
andSOCK_STREAM
? Does it multicast messages to 1+ sockets? Does the raw socket override all other sockets on the kernel?- I tried this using the
pnet
crate, and it looks like raw sockets can receive all packets for a given protocol, but this doesn’t stop the “regular” stream/datagram sockets from receiving this data too. - So I see what jonhoo meant about potentially running into conflicts if we were to implement a userspace TCP stack this way; we’d have to prevent the kernel from registering any other sockets.
- With that said, “for a given protocol” is probably key here - the main reason for
SOCK_RAW
’s existence is probably to define new transport protocols, so that can presumably provide enough isolation. IP protocol numbers0x90-0xFC
(144-252) are unassigned. - If I set the
pnet
socket’s protocol to 1 (ICMP), I can spy on allping
traffic but nothing else.
- I tried this using the
- Do both the
bind
er and theconnect
or need to maintain both send & recv queues? (I think yes)- YES
- Does the 32-bit seq number limit imply that only 4.2GB of data can be transferred over a single connection, or is it safe to assume that conflicts won’t occur when we wrap around?
- Why are
bind
andlisten
separate calls?
Notes
- Getting a bit boring at the 3:41:00 mark. Implementing the nitty-gritty of the actual protocol (window checking, etc.) isn’t super interesting to me; I’d rather stop there and learn more about the Linux networking space in general.
Impl Notes
-
pnet
crate for raw sockets: https://docs.rs/pnet/0.28.0/pnet -
Not using this because the kernel’s impl. of TCP can interfere (how?); using tun/tap instead.
-
The tun_tap crate allows sending packets represented as byte-buffers. What level of the stack do these packets live at? Level 1, 2, or 3?
- Ok it looks like this is the difference between TUN and TAP: /Screen Shot 2021-06-28 at 3.30.12 PM.png
-
You need to be root to create IP packets (how is this enforced?), but the
CAP_NET_ADMIN
capability is sufficient even if you aren’t root. -
Not sure how tun/tap is going to be used to create a TCP stack yet; it seems to be the kind of thing you’d use to create a whole new virtual network in userspace.
-
TUN packets can provide 4 bytes of extra info at the top of a message containing flags and the protocol. The protocol seems to match the “EtherType” field in Ethernet packets, so 0x800 is IPv4 and 0x86DD is IPv6.
-
This seems to be a theme, with lower-level protocols having knowledge of the higher-level protocol being encapsulated. IP packets have a protocol field too: 0x01 is ICMP and 0x06 is TCP.
-
A TCP connection is identified by the
(srcport, srcaddr, destport, destaddr)
quad (where srcport is randomly generated per-connection). -
The first TCP packet is header-only, so has a zero-byte payload.
-
Server flow:
- Start listening, state: LISTEN
- Receive a SYN, respond with a SYN_ACK, state: SYN RCVD
- Receive an ACK for the SYN_ACK, state: ESTAB
-
The RFC expects “remembered variables” (state) to be stored in a “Transmission Control Block” (TCB). These variables are:
Send Sequence Variables SND.UNA - send unacknowledged SND.NXT - send next SND.WND - send window SND.UP - send urgent pointer SND.WL1 - segment sequence number used for last window update SND.WL2 - segment acknowledgment number used for last window update ISS - initial send sequence number Receive Sequence Variables RCV.NXT - receive next RCV.WND - receive window RCV.UP - receive urgent pointer IRS - initial receive sequence number
-
ISS
specifies the zeroth index of the byte buffer that we’re trying to transmit; it doesn’t have to be zero, and wraps at some point (32 bits) -
Here’s a graphical representation of the send buffer: /Untitled-2021-06-28-2304.png
SND.UNA
marks the acked/un-acked boundarySND.NXT
marks the sent/unsent boundarySND.WND
specifies how many bytes can be sent, and this range starts fromSND.UNA
-
The receive buffer is simpler, it only marks the received (+acked) bytes with
RCV.NXT
andRCV.WND
sets the size of each “reception”. -
Every byte (octet) gets a sequence number
-
Segment == packet
-
Initial sequence number is randomly (sort of - ever increasing) chosen to prevent overlaps with other connections. These connections will use different quads, so this is a safety measure, not a necessity.
-
A segment stays in the network for a maximum bounded time (MSL - Max. Segment Lifetime), usually set to 4.55 hours.
-
This is the reason the second part of a handshake is a
SYN_ACK
:1) A --> B SYN my sequence number is X 2) A <-- B ACK your sequence number is X 3) A <-- B SYN my sequence number is Y 4) A --> B ACK your sequence number is Y
-
In this example the client is setting the window size for the data it’s sending. What if it sets it SUPER high? DDoS?
-
Woohoo!
2 0.347657799 192.168.0.1 → 192.168.0.5 TCP 60 59846 → 5000 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=175804761 TSecr=0 WS=128 3 0.347846056 192.168.0.5 → 192.168.0.1 TCP 40 5000 → 59846 [SYN, ACK] Seq=0 Ack=1 Win=10 Len=0 4 0.347905879 192.168.0.1 → 192.168.0.5 TCP 40 59846 → 5000 [ACK] Seq=1 Ack=1 Win=64240 Len=0
-
Here’s a very useful overview of the handshake itself: /Screen Shot 2021-06-29 at 4.35.13 PM.png
-
ACKs don’t take up space in the byte stream, and so don’t need to be ACKed!
-
Generate a
RST
when you see an unintended segment
Connection States
+---------+ ---------\ active OPEN
| CLOSED | \ -----------
+---------+<---------\ \ create TCB
| ^ \ \ snd SYN
passive OPEN | | CLOSE \ \
------------ | | ---------- \ \
create TCB | | delete TCB \ \
V | \ \
+---------+ CLOSE | \
| LISTEN | ---------- | |
+---------+ delete TCB | |
rcv SYN | | SEND | |
----------- | | ------- | V
+---------+ snd SYN,ACK / \ snd SYN +---------+
| |<----------------- ------------------>| |
| SYN | rcv SYN | SYN |
| RCVD |<-----------------------------------------------| SENT |
| | snd ACK | |
| |------------------ -------------------| |
+---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+
| -------------- | | -----------
| x | | snd ACK
| V V
| CLOSE +---------+
| ------- | ESTAB |
| snd FIN +---------+
| CLOSE | | rcv FIN
V ------- | | -------
+---------+ snd FIN / \ snd ACK +---------+
| FIN |<----------------- ------------------>| CLOSE |
| WAIT-1 |------------------ | WAIT |
+---------+ rcv FIN \ +---------+
| rcv ACK of FIN ------- | CLOSE |
| -------------- snd ACK | ------- |
V x V snd FIN V
+---------+ +---------+ +---------+
|FINWAIT-2| | CLOSING | | LAST-ACK|
+---------+ +---------+ +---------+
| rcv ACK of FIN | rcv ACK of FIN |
| rcv FIN -------------- | Timeout=2MSL -------------- |
| ------- x V ------------ x V
\ snd ACK +---------+delete TCB +---------+
------------------------>|TIME WAIT|------------------>| CLOSED |
+---------+ +---------+
TCP Connection State Diagram
Figure 6.