- The talk was broadly themed around the idea of incremental, just-enough improvements at the infra level, and realizing when to make the next one.
- Instagram scaled to 30m MAUs from 2010-12, and to 300m MAUs by 2015 (when this talk was given).
- 2012 Stack
- 2015 Stack
- Internal policy is to have all requests return within 3 seconds.
- This seems almost too liberal; is this possibly still true?
- Initially used “gearman” as a task queue, and this served them well for 18 months.
- Once they outgrew this, they used a home-grown sharding scheme which took them a little further.
- They eventually switched to Celery + RabbitMQ.
- Gearman’s average enqueue time was 60ms at p50 and 1 second at p95.
- With RabbitMQ this dropped to 5ms at p50 and tens of ms at p95.
- Company culture of getting existing infra to work for them over switching to new infra unless absolutely necessary.
- I can see how this would be a double-edged sword, but this seems like sound advice for a small engineering team.
- Deployments started with fabric and
git pull
- Eventually moved to uploading an artifact to S3, and having Fabric pull this artifact down to each machine + perform the switch via
ln
- Once this grew cumbersome they built what seems to be a manual lock service, so engineers could ensure they weren’t stepping on each others' toes.
- This eventually led to a real CD pipeline.
- “Don’t automate something until you understand it really well by doing it manually a lot”
- Their search system started with Postgres
LIKE
queries, which is actually not that bad for prefix searches, but pretty much O(n)
for anything else.
- This gave way to a
solr
box that allowed more expressive searches. Lack of clustering led them to outgrow this piece of infra.
- Next up was Elasticsearch, which worked well, but was too expensive in terms of ops load for a team of their size.
- Confident that they could’ve made this work with a dedicated search team.
- Used Facebook’s internal graph database (unicorn), which allowed for very expressive s-exp based searches.
- The Explore page was entirely driven by the expressiveness of the searches they could perform.
- Started of with top-liked images globally
- Moved to things people you follow like
- Realized this didn’t work because you don’t necessarily share the same taste as everyone you follow.
- Moved to things people you liked like, which ended up working really well.
Follow-Ups