Writing SIMD/AVX-512 Code in Go

Rust includes the asm! macro, which lets you drop down into assembly:

use std::arch::asm;

// Multiply x by 6 using shifts and adds
let mut x: u64 = 4;
unsafe {
    asm!(
        "mov {tmp}, {x}",
        "shl {tmp}, 1",
        "shl {x}, 2",
        "add {x}, {tmp}",
        x = inout(reg) x,
        tmp = out(reg) _,
    );
}
assert_eq!(x, 4 * 6);

I recently discovered that Golang ships with an assembler that you can use to acheieve a similar result. Go’s assembly doesn’t map directly to the underlying machine and is instead based on Plan 9 syntax. Here’s the Go equivalent of the Rust code above:

main.go

package main

import "fmt"

func mul6(x uint64) uint64

func main() {
	got := mul6(uint64(4))
	fmt.Printf("mul6(%d) = %d\n", x, got)
}

mul6_amd64.s

#include "textflag.h"

TEXT ·mul6(SB), NOSPLIT, $0-16
	MOVQ	x+0(FP), AX
	MOVQ	AX, CX		// tmp = x
	SHLQ	$1, CX		// tmp <<= 1   (2x)
	SHLQ	$2, AX		// x   <<= 2   (4x)
	ADDQ	CX, AX		// x  += tmp   (6x)
	MOVQ	AX, ret+8(FP)
	RET

You can run this the usual way:

❯ GOARCH=amd64 go run .
mul6(4) = 24

SIMD

CPU instructions typically operate on a single word at a time (typically 64 or 32 bits). SIMD (single instruction, multiple data) specifies a separate set of instructions that are able to operate on larger amounts of data with a single instruction. Of course this requires hardware support, including larger registers that are able to store multiple words of data.

SWAR 7× SIMD 16×

0 64 128 192 256 288

cyc

scalar

0/288

SWAR

✓ done

0/44

SIMD

✓ done

0/18

load compare set/extract store branch —

This visualization illustrates SIMD vs. non-SIMD implementations of a search across 32 bytes. Scalar is a simple one-byte-at-a-time for loop, SWAR packs each word as tightly as possible, and SIMD uses 16-byte SIMD instructions.

Common SIMD implementations include:

	Size	Vendor
Neon	128-bit	ARM
AVX2	256-bit	Intel
AVX-512	512-bit	Intel

AVX-512 in particular is able to fetch an entire (64-byte) cache line using a single instruction, and all AVX-512 instructions can operate on that cache line in its entirety.

Text Search Using SIMD

I wanted to explore this by building something a bit more substantial, and after some detours I eventually settled on a simple SIMD-based grep replacement. Let’s use Golang to build a command-line tool that will accept a filename and a pattern, and count the number of times the pattern occurs in the file using SIMD instructions. Let’s target AVX-512 in particular, and limit the size of allowed patterns so they always fit in a single AVX-512 register (512 bits or 64 bytes).

Taking inspiration from prior art in this space, we’ll use a SIMD-accelerated inner loop to identify candidates, and verify those candidates using “regular” Golang code. The inner loop will identify candidates only by their first and last characters. Put another way, it will find all substrings that are the same length as the pattern and share both its first and last characters.

Here’s the high-level algorithm. For each iteration of this inner loop, we start by loading four 512-bit (or 64-byte) registers:

Register ZMM_F: create (or broadcast) a 512-bit vector containing just the first character of the pattern
Register ZMM_L: create (or broadcast) a 512-bit vector containing just the last character of the pattern
Register ZMM_A: load the first 512 bits of the text you’re searching through
Register ZMM_B: load 512 bits of the text offset by the pattern length

We then perform two equality comparisons. In both cases, the result is a 64-bit value with a raised bit if both ZMM bytes at that position are equal.

m₁: Use VPCMPEQB to compare (byte-level equality) ZMM_F against ZMM_A.
m₂: Use VPCMPEQB to compare (byte-level equality) ZMM_L against ZMM_B

We then AND the results of the equality comparisons:

AND: use KANDQ to combine m₁ and m₂

The output of this final AND operation is yet another 64-bit value (one bit per ZMM byte) that has a raised bit for every position in ZMM_A that has a candidate (same length as pattern, and same starting and ending character).

Here is a visualization of the behavior described above:

candidates 3 match i=20

pattern

jumped

first j last d offset 5

haystack

the quick brown fox jumped over a jagged log and joined a jaded crow.

ZMM_F

jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj

ZMM_A

the quick brown fox jumped over a jagged log and joined a jaded

m₁

ZMM_L

dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

ZMM_B

uick brown fox jumped over a jagged log and joined a jaded crow.

m₂

AND

↑ TZCNT = 20

verify

j ✓ u ✓ m ✓ p ✓ e ✓ d ✓

✓ match at buf[20..26] = "jumped"

first anchor last anchor candidate idle

Illustration of the SIMD-based inner loop for pattern matching. All green cells in the last row indicate possible matches that are then verified in an outer (pure-Golang) loop.

Writing SIMD Code in Golang

Golang doesn’t perform vectorization (implicitly convert non-SIMD code into SIMD code), so the only way you can use SIMD from Golang is by dropping down into assembly.

Here’s the generated assembly for the SIMD inner loop I described above:

scan_amd64.s

// ============================================================================
// ScanAVX512: scan haystack for the first position where haystack[i]==first
//             AND haystack[i+offset]==last. Returns that i, or -1.
//
// Plan 9 ABI — args sit on the caller's frame at fixed (FP) offsets:
//   haystack []byte  base=  +0(FP)   len= +8(FP)   cap= +16(FP) (unused)
//   first    byte         +24(FP)
//   last     byte         +25(FP)
//   offset   int          +32(FP)
//   ret      int          +40(FP)
//
// Frame:  $0-48 = 0 bytes of locals, 48 bytes of args+ret on caller frame.
// NOSPLIT: skip stack-growth check (function uses no Go stack).
// ============================================================================
// func ScanAVX512(haystack []byte, first byte, last byte, offset int) int
// Requires: AVX512BW (byte-granular VPCMPEQB on ZMM), BMI (TZCNT)
TEXT ·ScanAVX512(SB), NOSPLIT, $0-48
      // ---- Load arguments into GPRs --------------------------------------
      MOVQ         haystack_base+0(FP), AX     // AX = &haystack[0]      (base ptr)
      MOVQ         haystack_len+8(FP), CX      // CX = len(haystack)
      MOVB         first+24(FP), DL            // DL = first byte (anchor #1)
      MOVB         last+25(FP), BL             // BL = last  byte (anchor #2)
      MOVQ         offset+32(FP), SI           // SI = offset = plen-1
      MOVBLZX      DL, DX                      // DX = zero-extended first  (for VPBROADCASTB)
      MOVBLZX      BL, BX                      // BX = zero-extended last

      // ---- Compute the largest legal start index --------------------------
      // The loop loads two 64-byte windows: one at i and one at i+offset.
      // The last legal i therefore satisfies  i + offset + 64 <= len.
      // Reuse CX as that bound:  CX = len - offset - 64.
      SUBQ         SI, CX                      // CX = len - offset
      SUBQ         $0x40, CX                   // CX = len - offset - 64   (max i)

      MOVQ         $-1, DI                     // DI = sentinel "not found" return
      CMPQ         CX, $0x00
      JL           done_notfound               // buffer too small for one window → bail

      // ---- Broadcast anchors across full ZMM registers --------------------
      VPBROADCASTB DX, Z0                      // Z0 = 64 copies of `first`
      VPBROADCASTB BX, Z1                      // Z1 = 64 copies of `last`

      XORQ         DX, DX                      // DX = i = 0  (loop induction var)

  loop:
      // ---- Bounds check: i must be <= max-i (CX) --------------------------
      // Note: JG is signed; CX>=0 here so this is fine. JG (not JGE) means
      // we still process the iteration where i == CX (the last valid window).
      CMPQ     DX, CX
      JG       done_notfound

      // ---- Load the two candidate windows ---------------------------------
      VMOVDQU8 (AX)(DX*1), Z2                  // Z2 = haystack[i .. i+63]   (the "first-byte" window)
      MOVQ     DX, BX
      ADDQ     SI, BX                          // BX = i + offset
      VMOVDQU8 (AX)(BX*1), Z3                  // Z3 = haystack[i+offset .. i+offset+63]

      // ---- Anchor compares (64 byte-tests in parallel each) ---------------
      VPCMPEQB Z0, Z2, K1                      // K1[j] = (Z2[j] == first)  for j in 0..63
      VPCMPEQB Z1, Z3, K2                      // K2[j] = (Z3[j] == last)
      KANDQ    K1, K2, K1                      // K1 = K1 AND K2  (both anchors hit)

      // ---- Inspect the candidate mask -------------------------------------
      KMOVQ    K1, BX                          // BX = 64-bit candidate mask
      TESTQ    BX, BX
      JZ       next                            // no candidates in this window → advance

      // Lowest set bit = earliest candidate offset within the window.
      TZCNTQ   BX, AX                          // AX = trailing-zero count = first match offset
      ADDQ     DX, AX                          // AX = i + that offset = absolute index
      MOVQ     AX, ret+40(FP)
      RET

  next:
      // Advance one full ZMM stride (64 bytes). Note: this *cannot* miss a
      // hit because anchor #1 lives entirely within the [i, i+64) window we
      // just compared; a candidate starting at i+64 will be covered next iter.
      ADDQ $0x40, DX
      JMP  loop

  done_notfound:
      MOVQ DI, ret+40(FP)                      // return -1
      RET

You call ScanAVX512 just like a regular Go function, and it will return the index of the first matching candidate (if any) in the haystack. Wrapper code verifies matches and handles windowing across the whole haystack.

How Does SIMD Perform?

I used hyperfine to compare the SIMD implementation above against grep, ripgrep, and a pure-Golang text search that I wrote as a baseline. These were executed on an Intel Xeon Platinum 8168 (Skylake-SP, AVX-512BW) against a 1GB haystack. Warmup runs were used to fault the haystack into memory/page cache before benchmarking runs.

Tool	Time	Throughput	Speedup
sgrep scalar	799 ms	~1.3 GB/s	1.00×
sgrep AVX-512	231 ms	~4.6 GB/s	3.46×
ripgrep (`-c -F`)	262 ms	~4.1 GB/s	3.05×
GNU grep (`-c`)	707 ms	~1.5 GB/s	1.13×

Overall, not too bad! A significant speedup over grep for not too much extra complexity, especially if you are specifically targeting amd64 servers that are likely to support AVX-512.

The memory bandwidth on this machine was measured to be 11GB/s. So while the SIMD approach is 3X faster than both the pure-Golang implementation and grep, there’s a long way to go until the theoretical limit.

An exercise for another day!

Administrivia

All code for this implementation is here: https://github.com/timothyandrew/sgrep
All text above was written by hand. Code was written with AI assistance.
Thanks to mazzo.li for the inspiration!