A journey to searching Have I Been Pwned database in 49μs
https://news.ycombinator.com/item?id=22459661
This is a neat anecdote: progressively optimizing the process of searching through a 22G file or SHA-1 hashes. At a high level:
grep
: too slow, ~30s for a single pass.- trie: not a great fit because the input is too random to benefit from the shared-prefix compression that tries provide.
- sort the file + binary search: this is better; merge sort is really the only thing that makes sense at this magnitude.
- sort but don’t merge: create 256 buckets representing the prefixes 00 to FF. This allows for some compression as well.
- B-Tree: this is the most flexible, allowing for efficient insertions in addition to retrieval.
Micro-benchmark:
| | time [μs] | % |
|-----------------:|------------:|------------:|
| okon | 49 | 100 |
| grep '^hash' | 135'350'000 | 276'224'489 |
| grep | 135'480'000 | 276'489'795 |
| C++ line by line | 135'720'201 | 276'980'002 |