Clustering
A single server has hard limits: fixed CPU, memory, disk, and network bandwidth. When your service needs to exceed those limits — or survive individual machine failures — the answer is clustering: grouping multiple servers so they function as a unified, coordinated system. Clustering is the architectural foundation beneath databases like Cassandra, message brokers like Kafka, container orchestrators like Kubernetes, and virtually every service you depend on at scale.
What Is a Cluster?
A cluster is a group of servers (called nodes) that work together and appear to clients as a single system. Each node has its own CPU, memory, and storage, but they are connected via a high-speed network and coordinated by cluster software.
From a client’s perspective, the cluster looks like one powerful machine. Internally, the nodes:
- Share workload across multiple machines
- Monitor each other’s health (heartbeating)
- Elect a coordinator when one is needed
- Redistribute work when a node fails or joins
Why Cluster?
Clustering solves three distinct problems:
High Availability
If one node fails, the cluster continues serving requests using the remaining nodes. There is no single point of failure. This is why production databases, message brokers, and orchestrators run on clusters rather than single machines.
Scalability
Adding more nodes increases total capacity. You can scale beyond the limits of any single machine without re-architecting your application.
Performance
Work is distributed across nodes, so requests are handled in parallel. Some cluster types (HPC, storage) specifically target raw throughput — running computations or serving data faster by splitting the problem across many machines.
Vertical scaling (scaling up) means upgrading a single server to a bigger machine. It’s simple but has an upper bound and creates downtime during upgrades. Horizontal scaling (scaling out) means adding more nodes to a cluster. It’s more complex but has no practical ceiling and allows rolling upgrades with zero downtime.
Active-Passive vs Active-Active
The most important design decision for any cluster is how nodes share responsibility for serving traffic.
Active-Passive (Failover Cluster)
One node is the active primary; it handles all requests. The other nodes are passive standbys; they monitor the primary and stay ready to take over. If the primary fails, a standby is promoted to primary — this is called failover.
Passive nodes are not wasted: they typically replicate data from the primary so they are ready to serve immediately after promotion.
- Simpler to reason about — only one node handles writes at a time, avoiding conflicts.
- Failover introduces brief downtime — usually seconds while the new primary is elected and DNS or clients switch over.
- Underutilizes hardware — standby nodes sit idle until needed.
Used by: Traditional relational databases (MySQL primary-replica, PostgreSQL streaming replication), Redis Sentinel.
Active-Active (Multi-Primary Cluster)
All nodes handle requests simultaneously. Traffic is distributed across the entire cluster, usually via a load balancer or client-side routing. All nodes are peers.
- Maximum resource utilization — every machine is earning its keep.
- Higher throughput and lower latency — requests are spread evenly.
- More complex — writes to multiple nodes must be coordinated to avoid conflicts. Systems either require consensus protocols or accept eventual consistency.
- No failover downtime — if a node fails, its traffic is absorbed by the others with no promotion step.
Used by: Cassandra, CockroachDB, Kafka brokers, Elasticsearch clusters, Kubernetes control plane.
| Dimension | Active-Passive | Active-Active |
|---|---|---|
| Nodes serving traffic | One (primary) | All |
| Failover downtime | Seconds (promotion) | None |
| Write conflicts | Impossible (single writer) | Must be resolved |
| Resource efficiency | Low (standbys idle) | High |
| Complexity | Lower | Higher |
| Common use | SQL databases, caches | NoSQL, message brokers, search |
Types of Clusters
Clusters are designed for different goals. Understanding the types helps you choose the right architecture for each layer of your system.
High Availability (HA) Clusters
The goal is eliminating downtime. Nodes monitor each other via heartbeats. If a node stops responding within the timeout window, the cluster initiates failover. The failed node’s virtual IP (VIP) or role is reassigned to a healthy node.
HA clusters are the standard pattern for production databases, web server pools, and any stateful service where downtime has business impact. Typical failover times are 5–30 seconds depending on the health check interval and election protocol.
Load Balancing Clusters
Traffic is distributed across multiple nodes to maximize throughput and minimize response time. All nodes are equivalent; the load balancer (or DNS) spreads requests among them. These clusters scale horizontally without limit, and node failures simply remove that node from the pool — the remaining nodes absorb the traffic.
Stateless application servers are the canonical example: any node can handle any request, so the cluster can grow or shrink dynamically.
High Performance Computing (HPC) Clusters
The goal is raw computational throughput for CPU-intensive batch jobs — scientific simulations, ML model training, video transcoding, financial risk modeling. A single job is split into subtasks that run in parallel across all nodes. Nodes are often connected via high-speed interconnects (InfiniBand, 100 GbE) and share a fast distributed filesystem.
Examples: Hadoop clusters (MapReduce), Spark clusters, GPU training clusters (Ray, Slurm).
Storage Clusters
Data is distributed across nodes for capacity and throughput. Clusters like Ceph, GlusterFS, or HDFS stripe data across nodes and maintain replicas for durability. Reads and writes are parallelized across nodes. This is how cloud object storage (S3) and distributed databases (Cassandra, HBase) achieve petabyte scale.
Real clusters combine these types. Cassandra is simultaneously a storage cluster (data sharded across nodes), a load balancing cluster (any node can serve any read), and an HA cluster (replicas ensure data survives node loss). Kubernetes is an HA cluster for its control plane and a load balancing cluster for workloads.
Leader Election
Many clustered systems need one node to act as coordinator — a leader or master. The leader might be responsible for accepting writes, assigning work to workers, managing cluster membership, or coordinating distributed transactions. When the current leader fails, the cluster must elect a new one without human intervention.
Leader election is the process by which nodes agree on which one is the leader. This is a distributed consensus problem: all nodes must agree on the same leader, even if messages are delayed or dropped, and even if multiple nodes simultaneously try to become leader (split-brain).
What Good Election Algorithms Guarantee
- Safety — at most one leader is active at any moment. Two nodes believing they are simultaneously the leader (split-brain) causes data corruption or split writes.
- Liveness — a new leader is always eventually elected after the current one fails.
- Durability — the leader’s decisions survive its own restart.
Raft
Raft is the most widely used modern consensus algorithm, designed to be understandable (explicitly as an alternative to the complex Paxos). It breaks the consensus problem into three sub-problems: leader election, log replication, and safety.
In Raft, each node is in one of three states: follower, candidate, or leader. Time is divided into terms. In each term, at most one leader is elected. If a follower does not hear from a leader within an election timeout, it becomes a candidate and requests votes. A candidate that receives votes from a majority of nodes becomes the new leader for that term.
Raft is used in: etcd (the backing store for Kubernetes), CockroachDB, TiKV, Consul.
Zookeeper Atomic Broadcast (ZAB)
Apache ZooKeeper uses ZAB, a protocol similar in spirit to Raft. ZooKeeper itself is a coordination service that other distributed systems use to implement leader election (Kafka, Hadoop YARN, HBase). ZAB guarantees total order of messages and exactly-once delivery of state changes across the cluster.
Split-brain occurs when a network partition separates a cluster into two isolated groups, and each group elects its own leader. Both leaders accept writes, creating two divergent versions of state. When the partition heals, you have a conflict with no clean resolution. Consensus algorithms prevent split-brain by requiring a majority quorum (more than half the nodes) to elect a leader — only one partition can have a majority.
Quorum
A quorum is the minimum number of nodes that must agree for an operation to proceed. For a cluster of N nodes, quorum is typically floor(N/2) + 1 (a simple majority). This means:
- 3-node cluster: quorum = 2 (can tolerate 1 failure)
- 5-node cluster: quorum = 3 (can tolerate 2 failures)
- 7-node cluster: quorum = 4 (can tolerate 3 failures)
Quorum ensures that any two quorums overlap by at least one node, which prevents two separate majorities from making conflicting decisions. This is why cluster sizes are almost always odd numbers.
Coordination Services
Rather than implementing leader election and distributed consensus from scratch, most systems delegate to a dedicated coordination service. These services offer primitives like distributed locks, leader election, service registration, and key-value storage with strong consistency guarantees.
Apache ZooKeeper
The original coordination service for the Hadoop ecosystem. Provides a hierarchical namespace (like a filesystem) with watches: clients register callbacks that fire when a node changes. ZooKeeper underpins Kafka (broker election, topic metadata), HBase, and Hadoop YARN.
Architecture: A ZooKeeper ensemble is itself a cluster (typically 3 or 5 nodes). It uses ZAB for consensus. Reads can be served by any node; writes go through the leader.
etcd
A distributed key-value store built on Raft, originally developed at CoreOS and now the default state store for Kubernetes. Every cluster object (pods, deployments, secrets, configmaps) lives in etcd. etcd prioritizes consistency over availability: if it loses quorum, it stops accepting writes rather than risk divergent state.
Kubernetes dependency: The etcd cluster is the most critical component of a Kubernetes control plane. Losing etcd means losing all cluster state.
Consul
A service mesh and coordination tool from HashiCorp. Beyond key-value storage and leader election, Consul provides service discovery (services register themselves and clients query by name), health checking, and a built-in DNS interface. Often used when teams want a single tool for both coordination and service discovery.
| Service | Protocol | Primary use | Used by |
|---|---|---|---|
| ZooKeeper | ZAB | Coordination, distributed locks | Kafka, HBase, Hadoop |
| etcd | Raft | Cluster state store | Kubernetes, CockroachDB |
| Consul | Raft | Service discovery + coordination | Nomad, Vault, microservices meshes |
Clustering vs Load Balancing
The terms are related but distinct. Load balancing is a traffic distribution mechanism; clustering is a system architecture. Clusters often use load balancing internally.
| Dimension | Load Balancing | Clustering |
|---|---|---|
| What it does | Routes requests across servers | Groups servers into a coordinated unit |
| Awareness of state | Typically stateless (routes by algorithm) | Nodes share state and coordinate |
| Failover | Removes unhealthy backends from pool | Promotes standbys, re-elects leaders |
| Who manages it | Separate load balancer component | The cluster itself, via consensus |
When designing a stateful service (database, cache, message broker), mention clustering explicitly: how many nodes, active-passive or active-active, how leader election works, and what the quorum size is. For stateless application servers, a load balancer in front of a pool of identical nodes is usually sufficient — this is a degenerate form of a cluster without coordination overhead. Reserve the full clustering discussion for components where consistency and failover matter.