Scaling Instagram

The talk was broadly themed around the idea of incremental, just-enough improvements at the infra level, and realizing when to make the next one.
Instagram scaled to 30m MAUs from 2010-12, and to 300m MAUs by 2015 (when this talk was given).
- 2012 Stack
- 2015 Stack
Internal policy is to have all requests return within 3 seconds.
- This seems almost too liberal; is this possibly still true?
Initially used “gearman” as a task queue, and this served them well for 18 months.
- Once they outgrew this, they used a home-grown sharding scheme which took them a little further.
- They eventually switched to Celery + RabbitMQ.
- Gearman’s average enqueue time was 60ms at p50 and 1 second at p95.
  - With RabbitMQ this dropped to 5ms at p50 and tens of ms at p95.
Company culture of getting existing infra to work for them over switching to new infra unless absolutely necessary.
- I can see how this would be a double-edged sword, but this seems like sound advice for a small engineering team.
Deployments started with fabric and git pull
- Eventually moved to uploading an artifact to S3, and having Fabric pull this artifact down to each machine + perform the switch via ln
- Once this grew cumbersome they built what seems to be a manual lock service, so engineers could ensure they weren’t stepping on each others' toes.
- This eventually led to a real CD pipeline.
“Don’t automate something until you understand it really well by doing it manually a lot”
Their search system started with Postgres LIKE queries, which is actually not that bad for prefix searches, but pretty much O(n) for anything else.
- This gave way to a solr box that allowed more expressive searches. Lack of clustering led them to outgrow this piece of infra.
- Next up was Elasticsearch, which worked well, but was too expensive in terms of ops load for a team of their size.
  - Confident that they could’ve made this work with a dedicated search team.
- Used Facebook’s internal graph database (unicorn), which allowed for very expressive s-exp based searches.
The Explore page was entirely driven by the expressiveness of the searches they could perform.
- Started of with top-liked images globally
- Moved to things people you follow like
  - Realized this didn’t work because you don’t necessarily share the same taste as everyone you follow.
- Moved to things people you liked like, which ended up working really well.

Follow-Ups

Facebook: Unicorn Paper