Foundations

Scalability

● Intermediate ⏱ 14 min read scalability

A system that works beautifully at 1,000 users often collapses at 100,000. Scalability is the property that determines whether a system can grow to meet demand without a fundamental redesign. It is not just about adding servers — it requires deliberate architectural choices made early, because retrofitting scalability into a system that was never designed for it is expensive and risky. Understanding scalability deeply means knowing which parts of a system are easy to scale, which are hard, and why.

What Is Scalability?

Scalability is the ability of a system to handle a growing amount of work by adding resources. A scalable system can accommodate growth — in users, data volume, request rate, or geographic reach — without a proportional degradation in performance or a rewrite of its core architecture.

Scalability is closely related to but distinct from performance. Performance is how fast a system responds at a given load. Scalability is how performance changes as load increases. A system can be fast but not scalable (great at 1,000 RPS, terrible at 10,000 RPS) or scalable but not fast (consistent mediocre response time at any load).

💡
Latency vs Throughput

Latency is the time to complete a single request (milliseconds). Throughput is the number of requests a system can handle per unit of time (RPS, QPS). Scalability generally means increasing throughput while keeping latency acceptable. A system with high throughput but poor latency may be able to process many requests but makes each user wait too long.

Vertical Scaling

Vertical scaling (scaling up) means making individual machines more powerful: adding more CPU cores, more RAM, faster storage, or a faster network interface. The same server handles more load because it has more raw capacity.

Advantages

Disadvantages

Vertical scaling is a good short-term strategy and works well for stateful components like databases that are difficult to distribute. Most systems start vertically and only move to horizontal scaling when they hit the limits.

Horizontal Scaling

Horizontal scaling (scaling out) means adding more machines to the pool and distributing the load across all of them. Instead of one large server, you run many commodity servers behind a load balancer.

Advantages

Disadvantages

Vertical vs Horizontal

Vertical ScalingHorizontal Scaling
MechanismBigger machineMore machines
Upper limitLargest available instanceEffectively unlimited
Fault toleranceNone (still a single node)High (failures isolated to nodes)
Cost efficiency at scalePoorGood
Application changes neededNoneStateless design required
Operational complexityLowHigh
Best forDatabases, early-stage systemsWeb/app tier, stateless services

In practice, most production systems use both. The stateless application tier scales horizontally. The database scales vertically until it must be sharded or replicated.

Stateless Design

Horizontal scaling of application servers requires stateless design: each server treats every request as independent. No server stores session data, user state, or in-flight context locally. Any server can handle any request.

The Stateless Rule

If you shut down any one server and the user experience is unaffected (assuming the request is routed to another server), the application is stateless. If users lose their login session, shopping cart, or in-progress work when their server is restarted, the application has state that must be externalized.

Externalizing State

State does not disappear — it moves out of the application server into dedicated storage:

Sticky Sessions Are a Workaround, Not a Solution

Sticky sessions (session affinity) configure the load balancer to always send a user’s requests to the same server. This “solves” the stateful server problem without code changes, but it undermines scaling: load is no longer evenly distributed, and if that server fails, the user’s session is lost anyway. Externalize state properly instead.

Database Scaling

Databases are the hardest component to scale because they are inherently stateful. The techniques, in order of increasing complexity:

Read Replicas

Add one or more read-only replicas that receive a copy of all writes from the primary. Route read queries to replicas, write queries to the primary. Most applications are read-heavy (80–95% reads), so this can dramatically increase throughput. Replicas can lag slightly behind the primary (replication lag), so reads may return slightly stale data.

Connection Pooling

Database connections are expensive to establish. A connection pool (PgBouncer for PostgreSQL, ProxySQL for MySQL) maintains a pool of open connections and multiplexes many application threads over a smaller number of database connections. This allows hundreds or thousands of application instances to share a database without exhausting its connection limit.

Caching

An in-memory cache (Redis, Memcached) in front of the database absorbs a large fraction of read traffic. Frequently-accessed data is served from cache in microseconds without hitting the database at all. This is covered in depth in the Caching guide.

Sharding

Sharding (horizontal partitioning) splits the database into independent shards, each holding a subset of the data. A shard key (user ID, geographic region, hash) determines which shard stores each record. Each shard can live on a separate machine, distributing both storage and query load. Sharding adds significant complexity: cross-shard queries, rebalancing when adding shards, and hotspot management. Covered in the Sharding guide.

Vertical Scaling

Before any of the above, simply upgrade the database server. A larger instance with more RAM (larger buffer pool for caching pages), faster SSDs, and more CPU handles significantly more load. This is often the right first step and may be sufficient for years.

Scalability Patterns

Auto-Scaling

Cloud providers (AWS Auto Scaling, GCP Managed Instance Groups) can automatically add or remove instances based on metrics like CPU utilization or request queue depth. Auto-scaling handles traffic spikes without over-provisioning for constant peak capacity. It requires fast instance startup (pre-baked AMIs, container images) and stateless servers so new instances can accept traffic immediately.

Asynchronous Processing

Decouple slow or CPU-intensive operations from the request path using a message queue. Instead of processing a video upload synchronously (blocking the HTTP response for minutes), the web server enqueues a job and immediately returns a 202 Accepted. Background workers pull jobs from the queue and process them independently. Workers scale horizontally based on queue depth without any changes to the web tier.

Content Delivery Networks

Moving static asset delivery to a CDN removes a large class of requests from your origin servers entirely. An origin that serves 100 million page views per day might only receive 5 million requests once static assets are cached at the edge. This is the cheapest per-request scaling lever available. Covered in the CDN guide.

Microservices

Breaking a monolith into independent services allows each service to scale independently. If the search service is under heavy load, scale it without touching the payment or notification services. The tradeoff is significantly increased operational complexity — microservices are a scaling strategy that introduces its own class of distributed systems problems.

Database Read/Write Separation

Beyond just read replicas, architect the application to route all write traffic to one connection pool and all read traffic to another. This makes it trivial to add more read capacity (more replicas) independently of write capacity, and forces explicit thinking about which operations mutate state.

Scalability in System Design

Scalability questions in system design interviews are often phrased as: “Your system works for 1 million users. How do you scale it to 100 million?” The right approach is to identify the bottleneck at each scale tier and address it specifically.

A Practical Scaling Progression

Most systems follow a predictable progression as they grow:

  1. Single server — web app, database, and cache all on one machine. Suitable up to a few hundred concurrent users.
  2. Separate database server — move the database to its own machine so it gets dedicated resources. The web server can now be scaled independently.
  3. Add caching — put Redis in front of the database. Absorbs most read traffic.
  4. Scale the web tier horizontally — put a load balancer in front of multiple stateless web servers. Requires externalizing session state.
  5. Scale the database — add read replicas, then a larger primary instance, then connection pooling.
  6. CDN for static assets — move all static content to the edge.
  7. Async processing — offload heavy work (email, image processing, ML inference) to queues and workers.
  8. Geographic distribution — deploy to multiple regions with DNS-based routing for global users.
  9. Sharding — as a last resort when the database write path is the bottleneck and read replicas are insufficient.
⚠️
Don’t Over-Engineer Early

Premature optimization is the root of much unnecessary complexity. A startup serving 10,000 users does not need sharding, microservices, or multi-region deployment. Start simple, measure where the actual bottlenecks are, and scale specifically those components. The progression above is roughly the order in which bottlenecks appear — skip steps only if you have evidence you’ll need to.

💡
In System Design Interviews

When asked to scale a system, be systematic: identify the stateful components (databases, caches, file storage) and the stateless ones (application servers). State which components are bottlenecks at which scale. Propose vertical scaling first (acknowledge the ceiling), then horizontal scaling with the required stateless design changes. Mention read replicas before sharding — sharding is complex and often unnecessary. Always bring up caching as a force multiplier that defers database scaling. Interviewers reward candidates who can articulate why a technique applies, not just name it.