A journey to searching Have I Been Pwned database in 49μs

https://news.ycombinator.com/item?id=22459661

This is a neat anecdote: progressively optimizing the process of searching through a 22G file or SHA-1 hashes. At a high level:

  1. grep: too slow, ~30s for a single pass.
  2. trie: not a great fit because the input is too random to benefit from the shared-prefix compression that tries provide.
  3. sort the file + binary search: this is better; merge sort is really the only thing that makes sense at this magnitude.
  4. sort but don’t merge: create 256 buckets representing the prefixes 00 to FF. This allows for some compression as well.
  5. B-Tree: this is the most flexible, allowing for efficient insertions in addition to retrieval.

Micro-benchmark:

|                  |   time [μs] |           % |
|-----------------:|------------:|------------:|
|             okon |          49 |         100 |
|     grep '^hash' | 135'350'000 | 276'224'489 |
|             grep | 135'480'000 | 276'489'795 |
| C++ line by line | 135'720'201 | 276'980'002 |
Edit