Scalability
A system that works beautifully at 1,000 users often collapses at 100,000. Scalability is the property that determines whether a system can grow to meet demand without a fundamental redesign. It is not just about adding servers — it requires deliberate architectural choices made early, because retrofitting scalability into a system that was never designed for it is expensive and risky. Understanding scalability deeply means knowing which parts of a system are easy to scale, which are hard, and why.
What Is Scalability?
Scalability is the ability of a system to handle a growing amount of work by adding resources. A scalable system can accommodate growth — in users, data volume, request rate, or geographic reach — without a proportional degradation in performance or a rewrite of its core architecture.
Scalability is closely related to but distinct from performance. Performance is how fast a system responds at a given load. Scalability is how performance changes as load increases. A system can be fast but not scalable (great at 1,000 RPS, terrible at 10,000 RPS) or scalable but not fast (consistent mediocre response time at any load).
Latency is the time to complete a single request (milliseconds). Throughput is the number of requests a system can handle per unit of time (RPS, QPS). Scalability generally means increasing throughput while keeping latency acceptable. A system with high throughput but poor latency may be able to process many requests but makes each user wait too long.
Vertical Scaling
Vertical scaling (scaling up) means making individual machines more powerful: adding more CPU cores, more RAM, faster storage, or a faster network interface. The same server handles more load because it has more raw capacity.
Advantages
- Simplicity: No changes to application code. The same single-server architecture works — you just run it on bigger hardware.
- No distribution complexity: Data stays in one place. No need for coordination, consensus, or network calls between nodes.
- Lower operational overhead: One server is simpler to monitor, deploy to, and debug than a fleet.
Disadvantages
- Hard ceiling: There is a maximum size for a single machine. Once you hit the largest available instance type, you cannot scale further vertically.
- Cost curve: Doubling CPU and RAM does not double performance linearly — and the price of top-tier hardware grows faster than the performance gain. The biggest machines (like AWS
x2iedn.32xlarge) cost disproportionately more per unit of compute. - Single point of failure: One big machine is still one machine. If it fails, everything goes down.
- Downtime for upgrades: Resizing a server typically requires a restart, causing a service interruption.
Vertical scaling is a good short-term strategy and works well for stateful components like databases that are difficult to distribute. Most systems start vertically and only move to horizontal scaling when they hit the limits.
Horizontal Scaling
Horizontal scaling (scaling out) means adding more machines to the pool and distributing the load across all of them. Instead of one large server, you run many commodity servers behind a load balancer.
Advantages
- No ceiling: You can add servers indefinitely (within the limits of your load balancer and coordination overhead). Theoretical linear scalability.
- Fault tolerance: Individual server failures do not bring down the system. The load balancer routes around failed nodes.
- Commodity hardware: Cheaper per unit of compute than top-tier single machines. Cloud providers make it easy to add or remove instances in seconds.
- No downtime for capacity changes: Adding a server to the pool is live. Rolling deployments allow code updates without taking any servers offline.
Disadvantages
- Application must be designed for it: Horizontal scaling requires stateless application servers. Session state, local caches, and in-memory data structures break when requests can land on any server.
- Distributed coordination: Shared state (caches, locks, counters) must live in an external system (Redis, ZooKeeper). Cross-server consistency is harder to reason about.
- Operational complexity: A fleet of servers requires service discovery, health monitoring, distributed logging, and orchestration (Kubernetes, ECS).
Vertical vs Horizontal
| Vertical Scaling | Horizontal Scaling | |
|---|---|---|
| Mechanism | Bigger machine | More machines |
| Upper limit | Largest available instance | Effectively unlimited |
| Fault tolerance | None (still a single node) | High (failures isolated to nodes) |
| Cost efficiency at scale | Poor | Good |
| Application changes needed | None | Stateless design required |
| Operational complexity | Low | High |
| Best for | Databases, early-stage systems | Web/app tier, stateless services |
In practice, most production systems use both. The stateless application tier scales horizontally. The database scales vertically until it must be sharded or replicated.
Stateless Design
Horizontal scaling of application servers requires stateless design: each server treats every request as independent. No server stores session data, user state, or in-flight context locally. Any server can handle any request.
The Stateless Rule
If you shut down any one server and the user experience is unaffected (assuming the request is routed to another server), the application is stateless. If users lose their login session, shopping cart, or in-progress work when their server is restarted, the application has state that must be externalized.
Externalizing State
State does not disappear — it moves out of the application server into dedicated storage:
- Session data → Redis or a database. The session token in the cookie is a key; the data lives in Redis. Any server can look it up.
- In-memory caches → Distributed cache (Redis, Memcached). Local caches on each server lead to cache inconsistency and wasted memory.
- File uploads / user-generated content → Object storage (S3, GCS). Never write files to a local disk on an application server.
- Ephemeral computation state → Message queues. Long-running background jobs should be decoupled from request handlers and queued for separate workers.
Sticky sessions (session affinity) configure the load balancer to always send a user’s requests to the same server. This “solves” the stateful server problem without code changes, but it undermines scaling: load is no longer evenly distributed, and if that server fails, the user’s session is lost anyway. Externalize state properly instead.
Database Scaling
Databases are the hardest component to scale because they are inherently stateful. The techniques, in order of increasing complexity:
Read Replicas
Add one or more read-only replicas that receive a copy of all writes from the primary. Route read queries to replicas, write queries to the primary. Most applications are read-heavy (80–95% reads), so this can dramatically increase throughput. Replicas can lag slightly behind the primary (replication lag), so reads may return slightly stale data.
Connection Pooling
Database connections are expensive to establish. A connection pool (PgBouncer for PostgreSQL, ProxySQL for MySQL) maintains a pool of open connections and multiplexes many application threads over a smaller number of database connections. This allows hundreds or thousands of application instances to share a database without exhausting its connection limit.
Caching
An in-memory cache (Redis, Memcached) in front of the database absorbs a large fraction of read traffic. Frequently-accessed data is served from cache in microseconds without hitting the database at all. This is covered in depth in the Caching guide.
Sharding
Sharding (horizontal partitioning) splits the database into independent shards, each holding a subset of the data. A shard key (user ID, geographic region, hash) determines which shard stores each record. Each shard can live on a separate machine, distributing both storage and query load. Sharding adds significant complexity: cross-shard queries, rebalancing when adding shards, and hotspot management. Covered in the Sharding guide.
Vertical Scaling
Before any of the above, simply upgrade the database server. A larger instance with more RAM (larger buffer pool for caching pages), faster SSDs, and more CPU handles significantly more load. This is often the right first step and may be sufficient for years.
Scalability Patterns
Auto-Scaling
Cloud providers (AWS Auto Scaling, GCP Managed Instance Groups) can automatically add or remove instances based on metrics like CPU utilization or request queue depth. Auto-scaling handles traffic spikes without over-provisioning for constant peak capacity. It requires fast instance startup (pre-baked AMIs, container images) and stateless servers so new instances can accept traffic immediately.
Asynchronous Processing
Decouple slow or CPU-intensive operations from the request path using a message queue. Instead of processing a video upload synchronously (blocking the HTTP response for minutes), the web server enqueues a job and immediately returns a 202 Accepted. Background workers pull jobs from the queue and process them independently. Workers scale horizontally based on queue depth without any changes to the web tier.
Content Delivery Networks
Moving static asset delivery to a CDN removes a large class of requests from your origin servers entirely. An origin that serves 100 million page views per day might only receive 5 million requests once static assets are cached at the edge. This is the cheapest per-request scaling lever available. Covered in the CDN guide.
Microservices
Breaking a monolith into independent services allows each service to scale independently. If the search service is under heavy load, scale it without touching the payment or notification services. The tradeoff is significantly increased operational complexity — microservices are a scaling strategy that introduces its own class of distributed systems problems.
Database Read/Write Separation
Beyond just read replicas, architect the application to route all write traffic to one connection pool and all read traffic to another. This makes it trivial to add more read capacity (more replicas) independently of write capacity, and forces explicit thinking about which operations mutate state.
Scalability in System Design
Scalability questions in system design interviews are often phrased as: “Your system works for 1 million users. How do you scale it to 100 million?” The right approach is to identify the bottleneck at each scale tier and address it specifically.
A Practical Scaling Progression
Most systems follow a predictable progression as they grow:
- Single server — web app, database, and cache all on one machine. Suitable up to a few hundred concurrent users.
- Separate database server — move the database to its own machine so it gets dedicated resources. The web server can now be scaled independently.
- Add caching — put Redis in front of the database. Absorbs most read traffic.
- Scale the web tier horizontally — put a load balancer in front of multiple stateless web servers. Requires externalizing session state.
- Scale the database — add read replicas, then a larger primary instance, then connection pooling.
- CDN for static assets — move all static content to the edge.
- Async processing — offload heavy work (email, image processing, ML inference) to queues and workers.
- Geographic distribution — deploy to multiple regions with DNS-based routing for global users.
- Sharding — as a last resort when the database write path is the bottleneck and read replicas are insufficient.
Premature optimization is the root of much unnecessary complexity. A startup serving 10,000 users does not need sharding, microservices, or multi-region deployment. Start simple, measure where the actual bottlenecks are, and scale specifically those components. The progression above is roughly the order in which bottlenecks appear — skip steps only if you have evidence you’ll need to.
When asked to scale a system, be systematic: identify the stateful components (databases, caches, file storage) and the stateless ones (application servers). State which components are bottlenecks at which scale. Propose vertical scaling first (acknowledge the ceiling), then horizontal scaling with the required stateless design changes. Mention read replicas before sharding — sharding is complex and often unnecessary. Always bring up caching as a force multiplier that defers database scaling. Interviewers reward candidates who can articulate why a technique applies, not just name it.