Foundations

How Load Balancers Work

● Beginner ⏱ 20 min read network

A single server can only handle so much traffic. Load balancers are the components that let you scale horizontally — spreading work across a fleet of servers so no single machine becomes the bottleneck, and so that failures in individual servers don’t take down the whole system. Understanding how load balancers work, and the trade-offs between different algorithms and balancer types, is foundational knowledge for every system design interview.

What Is a Load Balancer?

A load balancer sits between clients and a pool of backend servers. Every incoming request arrives at the load balancer, which decides which server should handle it and forwards the request there. The client never communicates directly with a backend server — it only ever sees the load balancer’s address.

This indirection gives you three things:

  1. Horizontal scalability — add more backend servers to handle more traffic without changing anything the client sees.
  2. High availability — if one server fails, the load balancer stops sending traffic to it. Other servers absorb the load.
  3. Operational flexibility — you can deploy, restart, or drain individual servers without taking the service offline.
💡
Load Balancer vs Reverse Proxy

A load balancer is a specific type of reverse proxy. All load balancers are reverse proxies, but not all reverse proxies are load balancers. A reverse proxy with only one backend is still a reverse proxy, but becomes a load balancer only when it distributes traffic across multiple backends.

Load Balancing Algorithms

The algorithm determines which server handles each request. The right choice depends on whether your servers are identical, whether requests have varying cost, and whether client state matters.

Round Robin

The simplest algorithm. Requests are distributed sequentially — first request goes to server 1, second to server 2, third to server 3, then back to server 1, and so on.

Best for: Servers with identical specs where each request takes roughly the same time.

Weakness: Ignores server load. A slow server accumulates a queue while fast servers sit idle.

Weighted Round Robin

An extension of Round Robin where each server is assigned a weight proportional to its capacity. A server with weight 3 receives three requests for every one request sent to a server with weight 1.

Best for: Heterogeneous fleets where some servers are more powerful than others.

Least Connections

Routes each new request to the server with the fewest active connections at that moment.

Best for: Workloads where requests vary significantly in duration — long-lived connections like WebSockets, file uploads, or database queries. Naturally adapts to slower servers without manual weighting.

Weighted Least Connections

Combines least connections with per-server weights. The target is the server with the best ratio of available capacity to active connections.

Best for: Mixed fleets handling variable-duration requests.

IP Hash (Source IP Affinity)

The client’s IP address is hashed to deterministically select a server. The same client always reaches the same backend (as long as the server pool doesn’t change).

Best for: Applications that store session state on individual servers, or where cache locality matters. This is a simple form of sticky sessions.

Weakness: Uneven distribution when many clients share one IP (e.g. behind a corporate NAT). Reshuffles all assignments when servers are added or removed.

Random

A server is picked randomly for each request. Statistically approaches Round Robin at high volume with less coordination overhead.

Best for: Stateless, homogeneous fleets at high request rates where the overhead of tracking connections or sequence would outweigh the benefit.

AlgorithmState trackedBest when
Round RobinSequence counterEqual servers, equal request cost
Weighted Round RobinSequence + weightsUnequal server capacity
Least ConnectionsActive connection countVariable request duration
Weighted Least ConnectionsConnections + weightsVariable duration + unequal capacity
IP HashHash of client IPSession affinity needed
RandomNoneHigh-throughput stateless services
Default Choice

Least connections is a safe default for most web services. It handles variable request durations well without requiring you to pre-configure weights. Round robin works fine when requests are uniformly fast (e.g. a simple REST API with database-backed responses of consistent cost).

L4 vs L7 Load Balancers

Load balancers operate at different layers of the OSI model. The layer determines how much the balancer understands about the traffic it’s routing.

Layer 4 (Transport Layer)

An L4 load balancer operates on IP addresses and TCP/UDP ports. It sees: source IP, destination IP, source port, destination port. It does not inspect the payload — it just forwards packets.

L4 load balancers are common in front of databases, game servers, and any non-HTTP service.

Layer 7 (Application Layer)

An L7 load balancer understands the application protocol — typically HTTP/HTTPS. It can inspect headers, cookies, URLs, and the request body before deciding where to route.

This enables powerful routing rules:

DimensionL4L7
OSI layerTransport (4)Application (7)
InspectsIP + portHTTP headers, URL, cookies, body
Content-based routingNoYes
TLS terminationOptionalStandard (to inspect HTTPS)
PerformanceHigherLower (parsing overhead)
Use casesDatabases, game servers, raw TCPWeb APIs, microservices, multi-tenant apps
💡
Most Cloud LBs Are L7

AWS ALB, GCP HTTPS Load Balancer, and Nginx acting as a proxy are all L7. AWS NLB and GCP TCP Load Balancer are L4. In most web system designs, you’ll default to L7.

Health Checks

A load balancer is only useful if it routes to healthy backends. Health checks let the balancer continuously verify that each server is alive and capable of serving requests.

Passive Health Checks

The balancer monitors actual request traffic. If a server returns 5xx errors or times out repeatedly, it is marked unhealthy and removed from rotation. No extra traffic is generated — the health signal comes from real requests.

Downside: Clients see failures before the server is marked unhealthy.

Active Health Checks

The load balancer periodically sends synthetic probe requests (e.g. GET /health) to each backend, independent of client traffic. If a probe fails (connection refused, timeout, wrong status code), the server is ejected. Once probes succeed again, it is re-added.

Advantage: Catches failures before they affect clients.

A typical active health check configuration:

⚠️
Health Check Endpoint Design

Your /health endpoint should test actual readiness — can the server connect to its database, cache, and dependencies? A server that returns 200 but can’t reach its DB will cause application errors even after passing a superficial health check. Return a 503 if any critical dependency is unavailable.

Sticky Sessions

Some applications store client state (shopping carts, session tokens, in-progress uploads) on the specific server that handled the first request. Sticky sessions (also called session affinity or session persistence) ensure all requests from the same client go to the same backend.

Cookie-Based Stickiness

The load balancer sets a cookie (e.g. SERVERID=backend-3) on the first response. Subsequent requests from that browser include the cookie, and the balancer routes them to the same server.

Most reliable — survives client IP changes (mobile users on cellular). Used by AWS ALB’s sticky sessions feature.

IP-Hash Stickiness

The client’s IP is hashed to a backend (equivalent to the IP Hash algorithm above). Simpler but breaks when clients share an IP or when they change networks.

Drawbacks of Sticky Sessions

Sticky sessions introduce coupling between clients and servers:

Prefer Stateless Backends

The standard recommendation is to move session state out of application servers entirely — into a shared cache (Redis, Memcached) or a database. Stateless backends can be load-balanced with any algorithm, scale freely, and survive individual server failures without client impact.

Redundant Load Balancers

A single load balancer is a single point of failure. If it goes down, no traffic can reach your backends regardless of how healthy they are.

The standard solution is an active-passive or active-active pair:

Active-passive: Two load balancers share a virtual IP (VIP). The active one handles all traffic. The passive one continuously monitors the active. If the active fails, the passive takes over the VIP in seconds (via protocols like VRRP/HSRP). Clients notice a brief interruption at most.

Active-active: Both load balancers handle traffic simultaneously, typically behind a DNS round-robin or an upstream L4 router. More complex but doubles throughput and provides redundancy.

Cloud-managed load balancers (AWS ALB/NLB, GCP HTTPS LB) handle their own redundancy internally — you don’t manage this yourself. On-premises or self-managed deployments (HAProxy, Nginx, Envoy) need explicit HA configuration.

Managed Load Balancers

In practice, most teams use a managed service rather than running their own load balancer software:

💡
In System Design Interviews

When asked to design a scalable system, introduce a load balancer early — between clients and your web tier, and between your web tier and backend services. Mention L7 for HTTP workloads, active health checks, and stateless backends to avoid sticky-session complexity. If asked about fault tolerance, mention active-passive LB pairs or the managed service’s built-in redundancy.