Modern Compiler Construction

https://news.ycombinator.com/item?id=22214687

IDEs use compilers to provide niceties like “jump to definition” and more advanced features, but these compilers are different from the compilers you’d use for actual production code, and the gap is diverging more and more.

Classic Compiler Architecture

  • Frontend: Lexer → Parser → Typechecker
  • Backend: Code generator → Emitter

Lexers break source into tokens; parsers build an AST. This is not instant; the lexer → emitter pipeline might take like a minute to run in entirety.

Text Editor Compilers

  • Lexers are sufficient to provide syntax coloring.
  • Calculating the bounds of collapsible regions might require parsers.
  • Autocomplete might require the typechecker (for semantically-correct suggestions). Autocomplete also needs results fast (100s of ms at most), but a full run of the typechecker might take minutes for a large codebase.
  • A regular compiler assumes “correct code”, but an editor compiler cannot assume this. Error recovery has to be good, and assume in-progress code.
  • Keep in in-memory AST for each file; especially useful because the user is only changing one file at a time, so the cached ASTs for all other files that aren’t currently being edited are guaranteed to be accurate.
  • “Helicopter”-in to a specific point and request information about that AST node without necessarily starting from the top. Two models of doing this:
    • Maintain an incremental database of the state of the world; on every key press, update this database to reflect the change. this is too fiddly and complicated to work in practice
    • Every key press erases the state of the world, and you quickly scramble to build up a new representation by reusing pieces of existing data structures (like the cached ASTs from other files).
  • Use functional data structures (I think this specifically refers to immutability, but I could be wrong) so you can invalidate only things that have changed easily and be able to reason about the state of the world.

Misc

  • Initial C# version: every feature needed to be reimplemented in both the regular compiler and the editor compiler. This (among other things) prompted the “Roslyn” project (what is this?), where the compiler was built with an API.
  • Used a similar architecture for TypeScript, although the backend is a lot simpler (emits JS) than C# (emits machine code).
  • Laziness is important in the context of self-referential types, which might never terminate.
  • “Typescript is a compiler for tooling’s sake rather than for codegen’s sake.”
Edit