Implementing TCP in Rust (Attempt 2 - Golang)
Create a virtual NIC using TUN, which works at the IP layer
#!/usr/bin/env bash
sudo ip tuntap add user tim mode tun tun_tcp
sudo ip link set tun_tcp up
sudo ip addr add 10.0.0.1/24 dev tun_tcp
go build && ./tcp
Listen on the virtual NIC and print raw IP packets:
config := water.Config{DeviceType: water.TUN}
config.Name = "tun_tcp"
ifce, err := water.New(config)
if err != nil {
return err
}
buf := make([]byte, 1500)
for {
n, err := ifce.Read(buf)
if err != nil {
return err
}
log.Printf("Raw IP payload: % x\n", buf[:n])
}
Test using ping
:
# Terminal window 1
ping -b 10.0.0.10
# Terminal window 2 (where the listener is running)
2022/07/24 10:38:00 Raw IP payload: 45 00 00 54 00 00 40 00 40 01 26 9f 0a 00 00 01 0a 00 00 0a 08 00 d7 fb e5 af 00 01 48 59 dd 62 00 00 00 00 4e c4 07 00 00 00 00 00 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37
2022/07/24 10:38:01 Raw IP payload: 45 00 00 54 00 00 40 00 40 01 26 9f 0a 00 00 01 0a 00 00 0a 08 00 22 c6 e5 af 00 02 49 59 dd 62 00 00 00 00 02 f9 07 00 00 00 00 00 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37
Parse IP packets:
RFC: https://datatracker.ietf.org/doc/html/rfc791#section-3.1
Byte ordering:
The order of transmission of the header and data described in this
document is resolved to the octet level. Whenever a diagram shows a
group of octets, the order of transmission of those octets is the normal
order in which they are read in English. For example, in the following
diagram the octets are transmitted in the order they are numbered.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 1 | 2 | 3 | 4 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 5 | 6 | 7 | 8 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 9 | 10 | 11 | 12 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Transmission Order of Bytes
Figure 10.
Whenever an octet represents a numeric quantity the left most bit in the
diagram is the high order or most significant bit. That is, the bit
labeled 0 is the most significant bit. For example, the following
diagram represents the value 170 (decimal).
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|1 0 1 0 1 0 1 0|
+-+-+-+-+-+-+-+-+
Significance of Bits
Figure 11.
Similarly, whenever a multi-octet field represents a numeric quantity
the left most bit of the whole field is the most significant bit. When
a multi-octet quantity is transmitted the most significant octet is
transmitted first.
Header format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Example Internet Datagram Header
Parsing an IP header
type IP struct {
Version, HeaderLength, TypeOfService, Flags, TimeToLive, Protocol uint8
TotalLength, Identification, FragmentOffset, HeaderChecksum uint16
SourceAddress, DestinationAddress uint32
}
Read the first byte, and interpret the first four bytes as the version (byte » 4) and the next four bytes as the header length (byte & 0x0F)
func parseIPHeader(raw []byte) (ip IP, err error) {
buf := bytes.NewBuffer(raw)
ip.Version, err = buf.ReadByte()
if err != nil {
return
}
ip.HeaderLength = ip.Version & 0x0F
ip.Version = ip.Version >> 4
return
}
Printing out these fields for the ping example:
❯ ./run.sh
2022/07/24 11:09:36 IP Version: 4
2022/07/24 11:09:36 Header Length: 5
2022/07/24 11:09:37 IP Version: 4
2022/07/24 11:09:37 Header Length: 5
Read multi-byte fields as big-endian:
err = binary.Read(buf, binary.BigEndian, &ip.TotalLength)
if err != nil {
return
}
Parse/pretty-print IP addresses:
func formatIPAddress(addr uint32) string {
return fmt.Sprintf("%d.%d.%d.%d", addr>>24, (addr>>16)&0xFF, (addr>>8)&0xFF, addr&0xFF)
}
With the entire header (except “options”) parsed:
# Terminal 1
$ curl 10.0.0.10:4000
# Terminal 2
$ ./run.sh
IP version: 4
Header length: 5
Type of service: 0
Total length: 60
Identification: 23548
Flags: 10
Fragment offset: 0
TTL: 64
Protocol: 6
Header checksum: 51893
Source IP: 10.0.0.1
Dest IP: 10.0.0.10
And the same packet from termshark:
Do the same thing for the TCP payloads that the IP packets encapsulate
First seek forward to the end of the IP header:
// Jump to the end of the header, which comprises `HeaderLength` 32-bit words
buf.Seek(int64(ip.HeaderLength)*4, io.SeekStart)
And parse this:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Which looks like:
# Terminal 1
$ curl 10.0.0.10:4000
# Terminal 2
----- IP Header -----
IP version: 4
Header length: 5
Type of service: 0
Total length: 60
Identification: 21716
Flags: 10
Fragment offset: 0
TTL: 64
Protocol: 6
Header checksum: 53725
Source IP: 10.0.0.1
Dest IP: 10.0.0.10
----- TCP Header -----
Source port: 38536
Destination port: 4000
Sequence number: 1963088686
Acknowledgement number: 0
Data offset: 10
Reserved: 0
Control bits: [SYN]
Window: 64240
Checksum: 25038
Urgent pointer: 0
Connections are identified by the 4-tuple (source port, source IP, dest port, dest IP)
What state do we store for each connection?
The maintenance of a TCP connection requires the remembering of several variables. We conceive of these variables being stored in a connection record called a Transmission Control Block or TCB.
Send Sequence Variables
SND.UNA - send unacknowledged
SND.NXT - send next
SND.WND - send window
SND.UP - send urgent pointer
SND.WL1 - segment sequence number used for last window update
SND.WL2 - segment acknowledgment number used for last window update
ISS - initial send sequence number
Receive Sequence Variables
RCV.NXT - receive next
RCV.WND - receive window
RCV.UP - receive urgent pointer
IRS - initial receive sequence number
Current Segment Variables
SEG.SEQ - segment sequence number
SEG.ACK - segment acknowledgment number
SEG.LEN - segment length
SEG.WND - segment window
SEG.UP - segment urgent pointer
SEG.PRC - segment precedence value
Send Sequence Space
1 2 3 4
----------|----------|----------|----------
SND.UNA SND.NXT SND.UNA
+SND.WND
1 - old sequence numbers which have been acknowledged
2 - sequence numbers of unacknowledged data
3 - sequence numbers allowed for new data transmission
4 - future sequence numbers which are not yet allowed
Receive Sequence Space
1 2 3
----------|----------|----------
RCV.NXT RCV.NXT
+RCV.WND
1 - old sequence numbers which have been acknowledged
2 - sequence numbers allowed for new reception
3 - future sequence numbers which are not yet allowed
A connection progresses through a series of states during its
lifetime. The states are: LISTEN, SYN-SENT, SYN-RECEIVED,
ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK,
TIME-WAIT, and the fictional state CLOSED.
Let’s start with all those variables and a state:
type ConnectionState string
type Segment struct {
SequenceNumber, AcknowledgementNumber, Length uint32
Window, UrgentPointer, PrecedenceValue uint32
}
type Connection struct {
State ConnectionState
SendUnacknowledged, SendNext, SendWindow, SendUrgentPointer uint32
SendWL1, SendWL2, InitialSendSequenceNumber uint32
ReceiveNext, ReceiveUrgentPointer, InitialReceiveSequenceNumber uint32
ReceiveWindow uint16
CurrentSendSegment Segment
CurrentReceiveSegment Segment
}
A fundamental notion in the design is that every octet of data sent over a TCP connection has a sequence number. Since every octet is sequenced, each of them can be acknowledged. The acknowledgment mechanism employed is cumulative so that an acknowledgment of sequence number X indicates that all octets up to but not including X have been received.
Initialize some of these variables:
func (c *Connection) Initialize(header *TCP) {
c.State = "LISTEN"
c.InitialSendSequenceNumber = 512
c.SendUnacknowledged = c.InitialSendSequenceNumber
c.SendNext = c.InitialSendSequenceNumber + 1
c.SendWindow = 1024
c.InitialReceiveSequenceNumber = header.SequenceNumber
c.ReceiveNext = header.SequenceNumber + 1
c.ReceiveWindow = 1024
}
And assemble a SYN-ACK response:
func (c *Connection) HandleSegment(header *TCP, payload *bytes.Reader) (response TCP) {
response.SourcePort = header.DestinationPort
response.DestinationPort = header.SourcePort
response.SequenceNumber = c.SendNext
response.AcknowledgmentNumber = c.ReceiveNext
response.DataOffset = 5
response.ControlBits |= 0x02 // SYN
response.ControlBits |= 0x10 // ACK
response.Window = c.ReceiveWindow
c.State = "SYN_RCVD"
return
}
This works but stops short:
❯ sudo tshark -i tun_tcp
1 0.000000000 10.0.0.1 → 10.0.0.10 TCP 60 58802 → 4000 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=4259497560 TSecr=0 WS=128
2 0.000474022 10.0.0.10 → 10.0.0.1 TCP 40 4000 → 58802 [SYN, ACK] Seq=0 Ack=1 Win=1024 Len=0
Maybe checksums are always required and can’t be skipped?
❯ sudo tshark -i tun_tcp -o tcp.check_checksum:TRUE
1 0.000000000 10.0.0.1 → 10.0.0.10 TCP 60 46480 → 4000 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=4259554401 TSecr=0 WS=128
2 0.000182867 10.0.0.10 → 10.0.0.1 TCP 40 4000 → 46480 [SYN, ACK] Seq=0 Ack=1 Win=1024 [TCP CHECKSUM INCORRECT] Len=0
Reference for the math involved: https://datatracker.ietf.org/doc/html/rfc1071
Checksum:
func add1sComplement(x, y uint16) uint16 {
temp := int32(x) + int32(y)
sum := uint16(temp & 0xFFFF)
sum += uint16(temp >> 16)
return sum
}
func (ip *IP) CalcChecksum() uint16 {
var temp1, temp2 uint16
temp1 |= uint16(ip.Version) << 12
temp1 |= uint16(ip.HeaderLength) << 8
temp1 |= uint16(ip.TypeOfService)
temp2 |= uint16(ip.Flags) << 13
temp2 |= uint16(ip.FragmentOffset)
sum := uint16(0)
sum = add1sComplement(sum, temp1)
sum = add1sComplement(sum, ip.TotalLength)
sum = add1sComplement(sum, ip.Identification)
sum = add1sComplement(sum, temp2)
sum = add1sComplement(sum, uint16(ip.TimeToLive)<<8)
sum = add1sComplement(sum, uint16(ip.Protocol))
sum = add1sComplement(sum, uint16(ip.SourceAddress>>16))
sum = add1sComplement(sum, uint16(ip.SourceAddress&0xFFFF))
sum = add1sComplement(sum, uint16(ip.DestinationAddress>>16))
sum = add1sComplement(sum, uint16(ip.DestinationAddress&0xFFFF))
return ^sum
}
func (t *TCP) CalcChecksum(ip *IP) uint16 {
var temp uint16
temp |= uint16(t.DataOffset) << 12
temp |= uint16(t.ControlBits)
sum := uint16(0)
sum = add1sComplement(sum, uint16(ip.SourceAddress>>16))
sum = add1sComplement(sum, uint16(ip.SourceAddress&0xFFFF))
sum = add1sComplement(sum, uint16(ip.DestinationAddress>>16))
sum = add1sComplement(sum, uint16(ip.DestinationAddress&0xFFFF))
sum = add1sComplement(sum, uint16(ip.Protocol))
sum = add1sComplement(sum, uint16(20))
sum = add1sComplement(sum, t.SourcePort)
sum = add1sComplement(sum, t.DestinationPort)
sum = add1sComplement(sum, uint16(t.SequenceNumber>>16))
sum = add1sComplement(sum, uint16(t.SequenceNumber&0xFFFF))
sum = add1sComplement(sum, uint16(t.AcknowledgmentNumber>>16))
sum = add1sComplement(sum, uint16(t.AcknowledgmentNumber&0xFFFF))
sum = add1sComplement(sum, temp)
sum = add1sComplement(sum, t.Window)
sum = add1sComplement(sum, t.UrgentPointer)
return ^sum
}
And it works!
1 0.000000000 10.0.0.1 → 10.0.0.10 TCP 60 58680 → 4000 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=4261213851 TSecr=0 WS=128
2 0.000183952 10.0.0.10 → 10.0.0.1 TCP 40 4000 → 58680 [SYN, ACK] Seq=0 Ack=1 Win=1024 Len=0
3 0.000249325 10.0.0.1 → 10.0.0.10 TCP 40 58680 → 4000 [ACK] Seq=1 Ack=1 Win=64240 Len=0
Wrap up the handshake by updating the local send/receive pointers and move to the ESTAB
state:
if c.State == "SYN_RCVD" {
c.SendUnacknowledged = header.AcknowledgmentNumber
c.SendNext = header.AcknowledgmentNumber + 1
c.ReceiveNext = header.SequenceNumber + 1
c.State = "ESTAB"
}
Once the connection is established, accept incoming data and print it to stdout:
if c.State == "ESTAB" {
var n int64
n, err = io.Copy(os.Stdout, payload)
if err != nil {
return response, err
}
// Update recv pointers
c.ReceiveNext = header.SequenceNumber + uint32(n)
response.AcknowledgmentNumber = c.ReceiveNext
response.ControlBits |= 0x10 // ACK
// Update send pointers (not actually sending anything yet)
response.SequenceNumber = c.SendNext
response.SourcePort = header.DestinationPort
response.DestinationPort = header.SourcePort
response.DataOffset = 5
response.Window = c.ReceiveWindow
return
}
And it works!
The implementation doesn’t do any offset checking yet, and assumes that any incoming data is correct; I’ll need to implement these checks:
In response to sending data the TCP will receive acknowledgments. The
following comparisons are needed to process the acknowledgments.
SND.UNA = oldest unacknowledged sequence number
SND.NXT = next sequence number to be sent
SEG.ACK = acknowledgment from the receiving TCP (next sequence
number expected by the receiving TCP)
SEG.SEQ = first sequence number of a segment
SEG.LEN = the number of octets occupied by the data in the segment
(counting SYN and FIN)
SEG.SEQ+SEG.LEN-1 = last sequence number of a segment
A new acknowledgment (called an "acceptable ack"), is one for which
the inequality below holds:
SND.UNA < SEG.ACK =< SND.NXT
A segment on the retransmission queue is fully acknowledged if the sum
of its sequence number and length is less or equal than the
acknowledgment value in the incoming segment.
When data is received the following comparisons are needed:
RCV.NXT = next sequence number expected on an incoming segments, and
is the left or lower edge of the receive window
RCV.NXT+RCV.WND-1 = last sequence number expected on an incoming
segment, and is the right or upper edge of the receive window
SEG.SEQ = first sequence number occupied by the incoming segment
SEG.SEQ+SEG.LEN-1 = last sequence number occupied by the incoming
segment
A segment is judged to occupy a portion of valid receive sequence
space if
RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
or
RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND
The first part of this test checks to see if the beginning of the
segment falls in the window, the second part of the test checks to see
if the end of the segment falls in the window; if the segment passes
either part of the test it contains data in the window.
Actually, it is a little more complicated than this. Due to zero
windows and zero length segments, we have four cases for the
acceptability of an incoming segment:
Segment Receive Test
Length Window
------- ------- -------------------------------------------
0 0 SEG.SEQ = RCV.NXT
0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
>0 0 not acceptable
>0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND
Note that when the receive window is zero no segments should be
acceptable except ACK segments. Thus, it is be possible for a TCP to
maintain a zero receive window while transmitting data and receiving
ACKs. However, even when the receive window is zero, a TCP must
process the RST and URG fields of all incoming segments.
But first, let’s have the server write back to the client. Set up a basic REPL + indexing for the connections we hold:
func dispatch(line string, connections *Connections) {
if line == "c" || line == "connections" {
connections.Inspect()
}
if strings.HasPrefix(line, "write") {
words := strings.Split(line, " ")
if len(words) != 3 {
fmt.Fprintf(os.Stderr, "usage: write <conn_id> <text>\n")
return
}
connId, err := strconv.ParseInt(words[1], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "conn_id must be a number")
return
}
text := words[2]
connections.m[connections.ids[connId]].Write([]byte(text))
}
}
func repl(connections *Connections) {
reader := bufio.NewReader(os.Stdin)
for {
fmt.Print("> ")
line, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
dispatch(strings.TrimSpace(line), connections)
}
}
Writes increment snd.nxt
, and snd.una
is incremented when the writes are acknowledged.
Checksum code needed to be updated
Don’t send an ACK when receiving an ACK without a payload (this is a duplicate ACK)
Mutex on a connection
When both sides are writing, our write could be merged with an ACK, but this coordination is a bit more complex so I’m going to live with the duplication.
I started by setting snd.wnd
to a hardcoded buffer size, but realized it actually represents the number of bytes that the other side of the connection has room for. So I changed it to be the value in the header’s Window field every time we receive a packet.
Closing a connection is essentially:
- One side sends a FIN
- Other side ACKs the FIN
- Other side sends a FIN (may or may not be combined with 2.)
- The original initiator acks the FIN
- Connection is closed
So far so good, but we’re currently doing no checking against sequence numbers on either side. We’re supposed to:
- Ignore incoming segments that have already been acknowledged
- Ignore incoming segments that are ACKing data that has already been ACKed or data that hasn’t been sent yet
To demonstrate I added a large time.Sleep
before ACKing a packet, and I see this:
Fix by checking if an incoming segment is valid. The RFC defines “valid” as one of:
- The start of the segment is >=
rcv.nxt
<rcv.nxt + window size
- The end of the segment is >=
rcv.nxt
<rcv.nxt + window size
So technically this allows sloppy segments that partially contain already-acknowledged data, or contain some data outside the window. The second case is easy to manage: just accept data up to the window and discard the rest. But the first is a bit trickier because the window now has a gap, and some bookkeeping is required.
Anyway that check prevents duplicates but doesn’t prevent reordering:
The simple solution is to only accept segments where seg.seq
exactly equals rcv.nxt
and avoid buffering out-of-order segments. This is not incorrect but is inefficient because a redundant retransmission is required. I’ll start with this and maybe add buffering for out-of-order segments later.
This solves the ordering issue, but is very slow if the server is multi-threaded
Next steps:
- Implement index checks for the rcv side
- Implement index checks for acks
- Validate checksums
- Buffer out-of-order segments
- Retransmission
- Run with simulated packet loss
- Respect the receiver’s window size when sending data
- Per-connection write buffers (maybe)
- Read the RFC fully