How Load Balancers Work
A single server can only handle so much traffic. Load balancers are the components that let you scale horizontally — spreading work across a fleet of servers so no single machine becomes the bottleneck, and so that failures in individual servers don’t take down the whole system. Understanding how load balancers work, and the trade-offs between different algorithms and balancer types, is foundational knowledge for every system design interview.
What Is a Load Balancer?
A load balancer sits between clients and a pool of backend servers. Every incoming request arrives at the load balancer, which decides which server should handle it and forwards the request there. The client never communicates directly with a backend server — it only ever sees the load balancer’s address.
This indirection gives you three things:
- Horizontal scalability — add more backend servers to handle more traffic without changing anything the client sees.
- High availability — if one server fails, the load balancer stops sending traffic to it. Other servers absorb the load.
- Operational flexibility — you can deploy, restart, or drain individual servers without taking the service offline.
A load balancer is a specific type of reverse proxy. All load balancers are reverse proxies, but not all reverse proxies are load balancers. A reverse proxy with only one backend is still a reverse proxy, but becomes a load balancer only when it distributes traffic across multiple backends.
Load Balancing Algorithms
The algorithm determines which server handles each request. The right choice depends on whether your servers are identical, whether requests have varying cost, and whether client state matters.
Round Robin
The simplest algorithm. Requests are distributed sequentially — first request goes to server 1, second to server 2, third to server 3, then back to server 1, and so on.
Best for: Servers with identical specs where each request takes roughly the same time.
Weakness: Ignores server load. A slow server accumulates a queue while fast servers sit idle.
Weighted Round Robin
An extension of Round Robin where each server is assigned a weight proportional to its capacity. A server with weight 3 receives three requests for every one request sent to a server with weight 1.
Best for: Heterogeneous fleets where some servers are more powerful than others.
Least Connections
Routes each new request to the server with the fewest active connections at that moment.
Best for: Workloads where requests vary significantly in duration — long-lived connections like WebSockets, file uploads, or database queries. Naturally adapts to slower servers without manual weighting.
Weighted Least Connections
Combines least connections with per-server weights. The target is the server with the best ratio of available capacity to active connections.
Best for: Mixed fleets handling variable-duration requests.
IP Hash (Source IP Affinity)
The client’s IP address is hashed to deterministically select a server. The same client always reaches the same backend (as long as the server pool doesn’t change).
Best for: Applications that store session state on individual servers, or where cache locality matters. This is a simple form of sticky sessions.
Weakness: Uneven distribution when many clients share one IP (e.g. behind a corporate NAT). Reshuffles all assignments when servers are added or removed.
Random
A server is picked randomly for each request. Statistically approaches Round Robin at high volume with less coordination overhead.
Best for: Stateless, homogeneous fleets at high request rates where the overhead of tracking connections or sequence would outweigh the benefit.
| Algorithm | State tracked | Best when |
|---|---|---|
| Round Robin | Sequence counter | Equal servers, equal request cost |
| Weighted Round Robin | Sequence + weights | Unequal server capacity |
| Least Connections | Active connection count | Variable request duration |
| Weighted Least Connections | Connections + weights | Variable duration + unequal capacity |
| IP Hash | Hash of client IP | Session affinity needed |
| Random | None | High-throughput stateless services |
Least connections is a safe default for most web services. It handles variable request durations well without requiring you to pre-configure weights. Round robin works fine when requests are uniformly fast (e.g. a simple REST API with database-backed responses of consistent cost).
L4 vs L7 Load Balancers
Load balancers operate at different layers of the OSI model. The layer determines how much the balancer understands about the traffic it’s routing.
Layer 4 (Transport Layer)
An L4 load balancer operates on IP addresses and TCP/UDP ports. It sees: source IP, destination IP, source port, destination port. It does not inspect the payload — it just forwards packets.
- Faster — minimal parsing, can be implemented in hardware or the kernel.
- Protocol-agnostic — works with any TCP or UDP protocol.
- Cannot route based on content — no URL-based routing, no cookie inspection, no HTTP header awareness.
L4 load balancers are common in front of databases, game servers, and any non-HTTP service.
Layer 7 (Application Layer)
An L7 load balancer understands the application protocol — typically HTTP/HTTPS. It can inspect headers, cookies, URLs, and the request body before deciding where to route.
This enables powerful routing rules:
- Route
/api/*to API servers and/static/*to a CDN or file server. - Inspect the
Cookieheader to implement sticky sessions. - Route based on the
Hostheader for multi-tenant virtual hosting. - Terminate TLS at the balancer and forward plain HTTP to backends (TLS offloading).
- Rewrite URLs or inject headers before forwarding.
- More CPU-intensive — must fully parse HTTP requests.
- Requires TLS termination to inspect HTTPS traffic.
- Feature-rich — the standard choice for web applications.
| Dimension | L4 | L7 |
|---|---|---|
| OSI layer | Transport (4) | Application (7) |
| Inspects | IP + port | HTTP headers, URL, cookies, body |
| Content-based routing | No | Yes |
| TLS termination | Optional | Standard (to inspect HTTPS) |
| Performance | Higher | Lower (parsing overhead) |
| Use cases | Databases, game servers, raw TCP | Web APIs, microservices, multi-tenant apps |
AWS ALB, GCP HTTPS Load Balancer, and Nginx acting as a proxy are all L7. AWS NLB and GCP TCP Load Balancer are L4. In most web system designs, you’ll default to L7.
Health Checks
A load balancer is only useful if it routes to healthy backends. Health checks let the balancer continuously verify that each server is alive and capable of serving requests.
Passive Health Checks
The balancer monitors actual request traffic. If a server returns 5xx errors or times out repeatedly, it is marked unhealthy and removed from rotation. No extra traffic is generated — the health signal comes from real requests.
Downside: Clients see failures before the server is marked unhealthy.
Active Health Checks
The load balancer periodically sends synthetic probe requests (e.g. GET /health) to each backend, independent of client traffic. If a probe fails (connection refused, timeout, wrong status code), the server is ejected. Once probes succeed again, it is re-added.
Advantage: Catches failures before they affect clients.
A typical active health check configuration:
- Interval: every 10–30 seconds
- Timeout: 5 seconds
- Unhealthy threshold: 2–3 consecutive failures to eject
- Healthy threshold: 2–3 consecutive successes to re-add
Your /health endpoint should test actual readiness — can the server connect to its database, cache, and dependencies? A server that returns 200 but can’t reach its DB will cause application errors even after passing a superficial health check. Return a 503 if any critical dependency is unavailable.
Sticky Sessions
Some applications store client state (shopping carts, session tokens, in-progress uploads) on the specific server that handled the first request. Sticky sessions (also called session affinity or session persistence) ensure all requests from the same client go to the same backend.
Cookie-Based Stickiness
The load balancer sets a cookie (e.g. SERVERID=backend-3) on the first response. Subsequent requests from that browser include the cookie, and the balancer routes them to the same server.
Most reliable — survives client IP changes (mobile users on cellular). Used by AWS ALB’s sticky sessions feature.
IP-Hash Stickiness
The client’s IP is hashed to a backend (equivalent to the IP Hash algorithm above). Simpler but breaks when clients share an IP or when they change networks.
Drawbacks of Sticky Sessions
Sticky sessions introduce coupling between clients and servers:
- If a server fails, all its “stuck” clients lose their session and must restart.
- Traffic distribution becomes uneven if some clients are more active than others.
- Horizontal scaling is harder — adding a new server doesn’t immediately attract new traffic from stuck clients.
The standard recommendation is to move session state out of application servers entirely — into a shared cache (Redis, Memcached) or a database. Stateless backends can be load-balanced with any algorithm, scale freely, and survive individual server failures without client impact.
Redundant Load Balancers
A single load balancer is a single point of failure. If it goes down, no traffic can reach your backends regardless of how healthy they are.
The standard solution is an active-passive or active-active pair:
Active-passive: Two load balancers share a virtual IP (VIP). The active one handles all traffic. The passive one continuously monitors the active. If the active fails, the passive takes over the VIP in seconds (via protocols like VRRP/HSRP). Clients notice a brief interruption at most.
Active-active: Both load balancers handle traffic simultaneously, typically behind a DNS round-robin or an upstream L4 router. More complex but doubles throughput and provides redundancy.
Cloud-managed load balancers (AWS ALB/NLB, GCP HTTPS LB) handle their own redundancy internally — you don’t manage this yourself. On-premises or self-managed deployments (HAProxy, Nginx, Envoy) need explicit HA configuration.
Managed Load Balancers
In practice, most teams use a managed service rather than running their own load balancer software:
- AWS Application Load Balancer (ALB) — L7, HTTP/HTTPS, WebSocket support, path/host-based routing, sticky sessions, WAF integration. The default choice for web services on AWS.
- AWS Network Load Balancer (NLB) — L4, ultra-low latency, static IPs, handles millions of requests/sec. Use for TCP/UDP workloads or when a static IP is required.
- GCP HTTPS Load Balancer — global L7 with anycast frontend, integrates with Cloud CDN and Cloud Armor (WAF).
- Azure Application Gateway — L7 with SSL termination, WAF, autoscaling.
- HAProxy — open-source, high-performance software load balancer/proxy. Industry standard for self-hosted setups. Supports both L4 and L7 modes.
- Nginx — widely used as both a web server and L7 load balancer/reverse proxy. Easier to configure than HAProxy for simple HTTP use cases.
- Envoy — cloud-native proxy designed for microservices. Used as the data plane in Istio service meshes. Supports advanced features like circuit breaking, retries, and distributed tracing.
When asked to design a scalable system, introduce a load balancer early — between clients and your web tier, and between your web tier and backend services. Mention L7 for HTTP workloads, active health checks, and stateless backends to avoid sticky-session complexity. If asked about fault tolerance, mention active-passive LB pairs or the managed service’s built-in redundancy.