Why the Sorbet typechecker is fast

https://blog.nelhage.com/post/why-sorbet-is-fast/

Design with performance in mind + performance degradation directly affects developer productivity = ✨✨✨
Fascinating look at C++ perf optimization (which I’m not familiar with at any level).

Sorbet is fast. Numerous of our early users commented specifically on how fast it was, and how much they appreciated this speed. Our informal benchmarks on Stripe’s codebase clocked it as typechecking around 100,000 lines of code per second per core, making it one of the fastest production typecheckers we are aware of.

Writing in C++ doesn’t automatically make your program fast, and a program does not need to be written in C++ to be fast. However, using C++ well gives an experienced team a fairly unique set of tools to write high-performance software.

Modern hardware is incredibly sensitive to hit rates on CPU caches, and so our core data structures were designed (among other objectives) to maximize cache locality

One key design decision is that symbols and strings in a GlobalState are not individually heap-allocated. Instead, the GlobalState maintains large flat arrays of objects, and we refer to them using 32-bit indexes

Symbols are stored in a flat array, not scattered throughout the heap, helping caches perform better.

By using 32-bit indexes to refer to objects, instead of 64-bit pointers, we fit more references into each cache line. This also lets us get much more mileage out of the hardware caches. We pay the price of indexing into the array as opposed to directly dereferencing pointers, but this is a single instruction in practice, which is essentially free if it saves even a few cache misses.

absl::flat_hash_map, which is one of the fastest general-purpose hash tables I know of, due in part to a design optimized for cache usage.

Operations on strings are, in general, much slower than operations on fixed-size data structures and integer comparisons. Sorbet must accept strings (in the form of Ruby source code) as input, but we aim to avoid string operations as much as at-all possible after the parse stage.

The key abstraction we use to this end are the NameRef and SymbolRef types, which I mentioned previously. A GlobalState stores a table of every string we have ever encountered, which are interned so that the same string always has the same 32-bit NameRef identifier.

Early on in development, we talked about setting up a dedicated performance-monitoring setup that would continuously benchmark every revision and allow early detection of regression (ala https://speed.pypy.org/ or https://perf.rust-lang.org/ or https://arewefastyet.com/). We unfortunately never made time to prioritize it, and I’m confident we lost performance to many unnoticed regressions because of it.

Several of the performance optimizations mentioned here were initially implemented, not to make release builds faster, but in order to make our CI speed tolerable for our own purposes.