Tim Andrew May 2026 home blog learning rss

2026-05-12

FuzzyMatchV2: fzf's scoring algorithm, visualized

A visual walkthrough of how fzf threads a pattern through text using a modified Smith–Waterman DP, with an interactive playground for the score and consecutive-run matrices.

Inspired by Replacing a 3 GB SQLite database with a 7 MB FST binary — which sent me down the rabbit hole of how fzf actually scores matches.

fzf treats fuzzy matching as an optimization problem: among all ways a pattern can be threaded through the text in order, which way scores highest? FuzzyMatchV2 is a modified Smith–Waterman that answers that with a dynamic programming table — no character omissions or mismatches allowed, just choices about where the gaps go.

The four phases

Every call to FuzzyMatchV2 runs four phases:

  • Phase 1 — ASCII gate. A cheap two-pointer scan over the bytes (asciiFuzzyIndex) finds the leftmost minIdx and rightmost maxIdx that could possibly contain the pattern. If even this fails, return -1 immediately.
  • Phase 2 — bonuses & first row. Walk the text once, computing a per-position bonus B[i], recording F[k] = first index where pattern[k] appears, and filling the first row of the DP matrix H₀.
  • Phase 3 — DP fill. Fill H[1..M-1] row by row using the recurrence below. The max value in the last row is the best score.
  • Phase 4 — backtrace. Walk back through H (and the consecutive-run matrix C) to recover which text positions each pattern character matched.

Phase 2 is also doing double duty as a feasibility check: if it can’t even greedily locate every pattern char in order, the function exits with no match and Phase 3 is skipped entirely.

The scoring rules

SignalConstantValueWhy
Character matchscoreMatch+16Base reward for landing a pattern char.
Gap (skip) startsscoreGapStart−3Big penalty when you first open a gap.
Gap continuesscoreGapExtension−1Cheaper to keep extending an already-open gap.
Word boundarybonusBoundary+8Matching at the start of a word is meaningful.
After whitespace / BOSbonusBoundaryWhite+10Even stronger boundary.
After / , : ; |bonusBoundaryDelimiter+9Path/CSV-style separators.
camelCase / letter→digitbonusCamel123+7Implicit boundary inside a word.
Consecutive matchbonusConsecutive+4Reward runs so foob hits foobar, not foo-bar.
First-char multiplierbonusFirstCharMultiplier×2The first char you typed weighs more.

The boundary bonus is calibrated to be cancelled out by ~8 chars of gap. That keeps fzf a fuzzy finder rather than an acronym matcher: short tight matches beat sprawling boundary-hopping ones.

The recurrence

Let H[i][j] be the best score for matching pattern[0..i] against text[..j] where pattern[i] matches at position j. Two candidates compete:

s1 = H[i-1][j-1] + scoreMatch + bonus        (consume a pattern char)
s2 = H[i  ][j-1] + (inGap ? gapExt : gapStart)  (skip text char, stay on same pattern char)

H[i][j] = max(s1, s2, 0)

A parallel matrix C[i][j] tracks the length of the consecutive-match run ending at (i, j). When the run is > 1, the bonus is upgraded to the maximum of (this position’s bonus, the chunk’s leading bonus, the consecutive constant) — so a run of three matches gets credit for being a run, not just three isolated matches.

A subtle rule: if the gap branch is genuinely better even after applying that upgraded run-bonus to the match branch, the run is reset to zero — the algorithm prefers the high-scoring gap path and treats this match as an isolated start.

Try it

positions: score:

Bonus per character (B)

Each text position gets a static bonus based on the class transition from the previous character (e.g. delimiter → letter scores +9). These don't depend on the pattern at all — they're computed once, then reused.

DP matrix — hover any cell to see how it was computed

backtrace path positive score (heatmap) zero

Reading the matrix

Rows are pattern characters (top row = pattern[0]). Columns are text positions, but only the columns from F[0] (first viable start) through lastIdx (last position of pattern[M-1]) — the matrix is trimmed to the feasible window.

Cells in the last row are the candidate end positions. The argmax of that row is where the backtrace starts. The backtrace then steps diagonally on a match (consume a pattern char) or left on a gap (skip a text char), comparing H[i-1][j-1] vs. H[i][j-1] at each step.

Why a separate C matrix?

Without it, three consecutive matches would each be scored independently and a single high-bonus match (e.g. at a word boundary, +8) could beat them. C records how long the current run is, so when a run reaches length 2+, the per-cell bonus is upgraded to max(currentBonus, runStartBonus, bonusConsecutive). That’s what makes foob prefer “foobar” over “foo-bar”.

Scaling up: a list of candidates

So far we’ve only looked at scoring one piece of text. But in practice fzf is reading lines from stdin — often hundreds of thousands of them — and you want the top few ranked instantly as you type. The single-candidate scorer is the inner loop; everything around it is a pipeline.

The pipeline

  1. Ingest. Candidates arrive on stdin and are sliced into fixed-size chunks (Chunk, ~256 items each in fzf’s core.go).
  2. Score in parallel. A pool of goroutines pulls chunks off a queue and runs FuzzyMatchV2 on each candidate. Scoring is embarrassingly parallel — each candidate is independent.
  3. Filter. Anything with score ≤ 0 is discarded.
  4. Sort. Survivors go into a min-heap keyed by (−score, length, index). fzf only needs the top N visible rows, so this is effectively partial sort.
  5. Render. The TUI redraws on every keystroke.

The cross-keystroke cache

The most interesting optimization is what fzf does between keystrokes. If your query goes from foo to foob, the new query is a strict extension of the old one — and any candidate that didn’t contain foo certainly doesn’t contain foob. So fzf keeps the previous match set in a cache, and when the new query starts with the previous query, it only re-scores those survivors instead of the full corpus.

For a corpus of 100k items where only 50 match your first character, this turns 100k×len-of-query work per keystroke into ~50×len-of-query after that first keystroke. The cache is invalidated as soon as you delete a character or change the query in a way that isn’t a strict extension (e.g. switching modes between fuzzy and exact match).

matches of scored this keystroke saved by cache

    Toggle the cache off and watch the “scored” count: with the cache, it shrinks as you type; without it, fzf has to score the whole list on every keystroke.

    Pattern grammar

    The query string isn’t just one fuzzy pattern — fzf parses it into a small AND-of-terms expression:

    • foo bar — both terms must match (each independently fuzzy-scored, scores summed).
    • 'foo — exact substring match (no fuzziness).
    • ^foo — must appear at start.
    • foo$ — must appear at end.
    • !foo — must not appear (negation).

    So 'src ^algo means “contains the literal substring src AND starts with algo”. Each candidate has to satisfy every term; failing any one drops it.

    Tiebreakers

    When two candidates score the same, fzf breaks the tie with — in order — shorter length, then earlier original position. That’s why typing algo in a Go repo surfaces algo.go before algorithm_test.go: same fuzzy score, shorter wins.

    Bailouts and edge cases

    • Empty pattern → score 0, immediate return.
    • Pattern longer than text → no match.
    • Single-char pattern → skip Phase 3; just take the max of H₀. With forward=true, exit early at the first position with a boundary bonus.
    • N×M too big for the slab, or M>1000 → fall back to the O(n) greedy V1. The 1000 cap exists because H uses int16 and very long patterns could overflow.
    • ASCII fast path: Phase 1 also returns maxIdx, trimming the text window the DP has to consider — often a big win on long lines.

    Source

    The canonical implementation is in src/algo/algo.go; the function starts at FuzzyMatchV2 around line 428. The top-of-file comment walks through the scoring rationale with the same examples used here.