2026-05-10
Writing SIMD/AVX-512 Code in Go
Learning about Golang's assembler by building a simple SIMD-based grep alternative.
Rust includes the asm! macro, which lets you drop down into assembly:
use std::arch::asm;
// Multiply x by 6 using shifts and adds
let mut x: u64 = 4;
unsafe {
asm!(
"mov {tmp}, {x}",
"shl {tmp}, 1",
"shl {x}, 2",
"add {x}, {tmp}",
x = inout(reg) x,
tmp = out(reg) _,
);
}
assert_eq!(x, 4 * 6);
I recently discovered that Golang ships with an assembler that you can use to acheieve a similar result. Go’s assembly doesn’t map directly to the underlying machine and is instead based on Plan 9 syntax. Here’s the Go equivalent of the Rust code above:
package main
import "fmt"
func mul6(x uint64) uint64
func main() {
got := mul6(uint64(4))
fmt.Printf("mul6(%d) = %d\n", x, got)
}#include "textflag.h"
TEXT ·mul6(SB), NOSPLIT, $0-16
MOVQ x+0(FP), AX
MOVQ AX, CX // tmp = x
SHLQ $1, CX // tmp <<= 1 (2x)
SHLQ $2, AX // x <<= 2 (4x)
ADDQ CX, AX // x += tmp (6x)
MOVQ AX, ret+8(FP)
RETYou can run this the usual way:
❯ GOARCH=amd64 go run .
mul6(4) = 24
SIMD
CPU instructions typically operate on a single word at a time (typically 64 or 32 bits). SIMD (single instruction, multiple data) specifies a separate set of instructions that are able to operate on larger amounts of data with a single instruction. Of course this requires hardware support, including larger registers that are able to store multiple words of data.
This visualization illustrates SIMD vs. non-SIMD implementations of a search across 32 bytes. Scalar is a simple one-byte-at-a-time for loop, SWAR packs each word as tightly as possible, and SIMD uses 16-byte SIMD instructions.
Common SIMD implementations include:
| Size | Vendor | |
|---|---|---|
| Neon | 128-bit | ARM |
| AVX2 | 256-bit | Intel |
| AVX-512 | 512-bit | Intel |
AVX-512 in particular is able to fetch an entire (64-byte) cache line using a single instruction, and all AVX-512 instructions can operate on that cache line in its entirety.
Text Search Using SIMD
I wanted to explore this by building something a bit more substantial, and after some detours I eventually settled on a simple SIMD-based grep replacement. Let’s use Golang to build a command-line tool that will accept a filename and a pattern, and count the number of times the pattern occurs in the file using SIMD instructions. Let’s target AVX-512 in particular, and limit the size of allowed patterns so they always fit in a single AVX-512 register (512 bits or 64 bytes).
Taking inspiration from prior art in this space, we’ll use a SIMD-accelerated inner loop to identify candidates, and verify those candidates using “regular” Golang code. The inner loop will identify candidates only by their first and last characters. Put another way, it will find all substrings that are the same length as the pattern and share both its first and last characters.
Here’s the high-level algorithm. For each iteration of this inner loop, we start by loading four 512-bit (or 64-byte) registers:
- Register ZMMF: create (or broadcast) a 512-bit vector containing just the first character of the pattern
- Register ZMML: create (or broadcast) a 512-bit vector containing just the last character of the pattern
- Register ZMMA: load the first 512 bits of the text you’re searching through
- Register ZMMB: load 512 bits of the text offset by the pattern length
We then perform two equality comparisons. In both cases, the result is a 64-bit value with a raised bit if both ZMM bytes at that position are equal.
- m1: Use
VPCMPEQBto compare (byte-level equality) ZMMF against ZMMA. - m2: Use
VPCMPEQBto compare (byte-level equality) ZMML against ZMMB
We then AND the results of the equality comparisons:
- AND: use
KANDQto combine m1 and m2
The output of this final AND operation is yet another 64-bit value (one bit per ZMM byte) that has a raised bit for every position in ZMMA that has a candidate (same length as pattern, and same starting and ending character).
Here is a visualization of the behavior described above:
buf[20..26] = "jumped" Illustration of the SIMD-based inner loop for pattern matching. All green cells in the last row indicate possible matches that are then verified in an outer (pure-Golang) loop.
Writing SIMD Code in Golang
Golang doesn’t perform vectorization (implicitly convert non-SIMD code into SIMD code), so the only way you can use SIMD from Golang is by dropping down into assembly.
Here’s the generated assembly for the SIMD inner loop I described above:
// ============================================================================
// ScanAVX512: scan haystack for the first position where haystack[i]==first
// AND haystack[i+offset]==last. Returns that i, or -1.
//
// Plan 9 ABI — args sit on the caller's frame at fixed (FP) offsets:
// haystack []byte base= +0(FP) len= +8(FP) cap= +16(FP) (unused)
// first byte +24(FP)
// last byte +25(FP)
// offset int +32(FP)
// ret int +40(FP)
//
// Frame: $0-48 = 0 bytes of locals, 48 bytes of args+ret on caller frame.
// NOSPLIT: skip stack-growth check (function uses no Go stack).
// ============================================================================
// func ScanAVX512(haystack []byte, first byte, last byte, offset int) int
// Requires: AVX512BW (byte-granular VPCMPEQB on ZMM), BMI (TZCNT)
TEXT ·ScanAVX512(SB), NOSPLIT, $0-48
// ---- Load arguments into GPRs --------------------------------------
MOVQ haystack_base+0(FP), AX // AX = &haystack[0] (base ptr)
MOVQ haystack_len+8(FP), CX // CX = len(haystack)
MOVB first+24(FP), DL // DL = first byte (anchor #1)
MOVB last+25(FP), BL // BL = last byte (anchor #2)
MOVQ offset+32(FP), SI // SI = offset = plen-1
MOVBLZX DL, DX // DX = zero-extended first (for VPBROADCASTB)
MOVBLZX BL, BX // BX = zero-extended last
// ---- Compute the largest legal start index --------------------------
// The loop loads two 64-byte windows: one at i and one at i+offset.
// The last legal i therefore satisfies i + offset + 64 <= len.
// Reuse CX as that bound: CX = len - offset - 64.
SUBQ SI, CX // CX = len - offset
SUBQ $0x40, CX // CX = len - offset - 64 (max i)
MOVQ $-1, DI // DI = sentinel "not found" return
CMPQ CX, $0x00
JL done_notfound // buffer too small for one window → bail
// ---- Broadcast anchors across full ZMM registers --------------------
VPBROADCASTB DX, Z0 // Z0 = 64 copies of `first`
VPBROADCASTB BX, Z1 // Z1 = 64 copies of `last`
XORQ DX, DX // DX = i = 0 (loop induction var)
loop:
// ---- Bounds check: i must be <= max-i (CX) --------------------------
// Note: JG is signed; CX>=0 here so this is fine. JG (not JGE) means
// we still process the iteration where i == CX (the last valid window).
CMPQ DX, CX
JG done_notfound
// ---- Load the two candidate windows ---------------------------------
VMOVDQU8 (AX)(DX*1), Z2 // Z2 = haystack[i .. i+63] (the "first-byte" window)
MOVQ DX, BX
ADDQ SI, BX // BX = i + offset
VMOVDQU8 (AX)(BX*1), Z3 // Z3 = haystack[i+offset .. i+offset+63]
// ---- Anchor compares (64 byte-tests in parallel each) ---------------
VPCMPEQB Z0, Z2, K1 // K1[j] = (Z2[j] == first) for j in 0..63
VPCMPEQB Z1, Z3, K2 // K2[j] = (Z3[j] == last)
KANDQ K1, K2, K1 // K1 = K1 AND K2 (both anchors hit)
// ---- Inspect the candidate mask -------------------------------------
KMOVQ K1, BX // BX = 64-bit candidate mask
TESTQ BX, BX
JZ next // no candidates in this window → advance
// Lowest set bit = earliest candidate offset within the window.
TZCNTQ BX, AX // AX = trailing-zero count = first match offset
ADDQ DX, AX // AX = i + that offset = absolute index
MOVQ AX, ret+40(FP)
RET
next:
// Advance one full ZMM stride (64 bytes). Note: this *cannot* miss a
// hit because anchor #1 lives entirely within the [i, i+64) window we
// just compared; a candidate starting at i+64 will be covered next iter.
ADDQ $0x40, DX
JMP loop
done_notfound:
MOVQ DI, ret+40(FP) // return -1
RETYou call ScanAVX512 just like a regular Go function, and it will return the index of the first matching candidate (if any) in the haystack. Wrapper code verifies matches and handles windowing across the whole haystack.
How Does SIMD Perform?
I used hyperfine to compare the SIMD implementation above against grep, ripgrep, and a pure-Golang text search that I wrote as a baseline. These were executed on an Intel Xeon Platinum 8168 (Skylake-SP, AVX-512BW) against a 1GB haystack. Warmup runs were used to fault the haystack into memory/page cache before benchmarking runs.
| Tool | Time | Throughput | Speedup |
|---|---|---|---|
| sgrep scalar | 799 ms | ~1.3 GB/s | 1.00× |
| sgrep AVX-512 | 231 ms | ~4.6 GB/s | 3.46× |
ripgrep (-c -F) | 262 ms | ~4.1 GB/s | 3.05× |
GNU grep (-c) | 707 ms | ~1.5 GB/s | 1.13× |
Overall, not too bad! A significant speedup over grep for not too much extra complexity, especially if you are specifically targeting amd64 servers that are likely to support AVX-512.
The memory bandwidth on this machine was measured to be 11GB/s. So while the SIMD approach is 3X faster than both the pure-Golang implementation and grep, there’s a long way to go until the theoretical limit.
An exercise for another day!
Administrivia
- All code for this implementation is here: https://github.com/timothyandrew/sgrep
- All text above was written by hand. Code was written with AI assistance.
- Thanks to mazzo.li for the inspiration!