Availability
Every system fails eventually. Hard drives fail, networks partition, software panics, data centers flood. Availability is the measure of how often your system is usable when users need it. Designing for high availability means accepting that individual components will fail and building systems that survive those failures — staying online even as parts break around them. This is one of the most important properties in distributed system design, and nearly every architectural decision has availability implications.
What Is Availability?
Availability is the fraction of time a system is operational and able to serve requests. More precisely, it is the probability that a system is functioning correctly at any given moment.
Availability is typically expressed as a percentage:
Availability = Uptime / (Uptime + Downtime)
A system that is down for 1 hour in a month has an availability of roughly 99.86%. That sounds impressive, but 1 hour of downtime during peak traffic can be catastrophic for a business-critical service.
Availability is about uptime: is the system reachable right now? Reliability is about correctness over time: does the system produce the right results consistently? A system can be highly available (always up) but unreliable (returning wrong data). In practice, both matter — but availability is the more commonly specified SLA metric.
Measuring Availability: The Nines
Availability targets are described in “nines” — the number of nines in the percentage. Each additional nine is an order-of-magnitude reduction in allowed downtime:
| Availability | Nines | Downtime / year | Downtime / month | Downtime / week |
|---|---|---|---|---|
| 90% | One nine | 36.5 days | ~73 hours | ~16.8 hours |
| 99% | Two nines | 3.65 days | ~7.3 hours | ~1.68 hours |
| 99.9% | Three nines | 8.77 hours | ~43.8 min | ~10.1 min |
| 99.99% | Four nines | 52.6 min | ~4.4 min | ~1.01 min |
| 99.999% | Five nines | 5.26 min | ~26.3 sec | ~6.05 sec |
| 99.9999% | Six nines | 31.5 sec | ~2.6 sec | ~0.6 sec |
The jump from three nines (8.77 hours/year) to four nines (52.6 minutes/year) requires significant architectural investment. The jump from four to five nines is even harder and usually demands active-active multi-region deployments.
Most web services target 99.9% (three nines). Financial, medical, and telecommunications systems often target 99.99% or beyond. Six nines is rare and typically applies only to critical infrastructure like telephone switching systems. The cost of each additional nine grows non-linearly — achieving 99.999% may cost 10–100× more than 99.9%.
High Availability
High availability (HA) is the design goal of keeping a system operational despite component failures. HA systems are built on two principles:
- Eliminate single points of failure (SPOFs) — any component whose failure would bring down the entire system.
- Design for graceful degradation — when parts fail, the system continues serving requests at reduced capacity rather than failing completely.
Single Points of Failure
A single point of failure is any component that, if it fails, causes the entire system to fail. Common SPOFs include:
- A single web server — if it crashes, the site goes down.
- A single database — if it fails, no reads or writes are possible.
- A single load balancer — if it fails, traffic cannot be distributed.
- A single data center — power outage, network partition, or physical disaster takes everything offline.
- A single DNS server — if resolution fails, clients can’t find your service.
The process of designing for HA is largely the process of systematically identifying and eliminating SPOFs by adding redundancy at each layer.
Graceful Degradation
A well-designed HA system degrades gracefully rather than failing catastrophically. If the recommendation service is down, the main product pages still load — just without personalized recommendations. If the image CDN has issues, pages render with alt text rather than broken images. Features fail independently; the core path remains available.
The opposite of graceful degradation is a cascading failure: one component fails, overloading its neighbors, which then fail, propagating failure through the entire system. This is often worse than a single SPOF because the failure is non-obvious and hard to contain. Circuit breakers (covered in a later guide) are the primary tool for preventing cascading failures.
Redundancy
Redundancy means having multiple instances of a component so that if one fails, others continue to serve requests. It is the foundation of high availability.
Active-Passive (Standby) Redundancy
One instance is active (handling traffic) while one or more passive instances stand by in reserve. If the active instance fails, a passive instance is promoted to active and takes over.
- Hot standby: The passive instance is fully running, synchronized, and ready to take over in seconds. Failover is fast but you pay for the idle standby capacity.
- Warm standby: The passive instance is running but not fully synchronized. Failover takes longer (minutes) while it catches up. Cheaper than hot standby.
- Cold standby: The passive instance is not running — it must be started and provisioned when needed. Failover takes the longest (many minutes or hours). Used for cost-sensitive disaster recovery scenarios.
Active-Active Redundancy
All instances are active simultaneously, sharing the incoming load. A load balancer distributes requests across all of them. If one instance fails, the load balancer removes it from rotation and the remaining instances absorb its share of traffic.
- Advantage: No idle capacity. All resources are utilized. Failover is seamless — the load balancer simply stops sending traffic to the failed node without any promotion step.
- Disadvantage: More complex to implement, especially for stateful components like databases where all active nodes must remain consistent.
| Active-Passive | Active-Active | |
|---|---|---|
| Traffic handling | Single active instance | All instances |
| Resource utilization | 50% (standby is idle) | 100% |
| Failover time | Seconds to minutes | Near-instant |
| Complexity | Lower | Higher |
| Best for | Databases, stateful services | Stateless services, web tiers |
Failover
Failover is the process of automatically switching to a redundant component when the primary one fails. It consists of three steps:
- Detection — health checks determine that a component has failed (no response, wrong status, timeout).
- Decision — the system decides to switch traffic away from the failed component.
- Switchover — traffic is redirected to the healthy component.
The speed of failover is determined by how quickly failure is detected. Health check intervals and timeout thresholds control this. A health check every 10 seconds with a 3-failure threshold means detection takes up to 30 seconds before failover begins. Reducing this improves recovery time but increases false positives (declaring a healthy-but-slow node as failed).
Failover Challenges
Split-Brain
In active-passive setups, if the standby incorrectly believes the primary has failed (due to a network partition), both nodes may promote themselves to primary simultaneously. Both start accepting writes, leading to conflicting data. This is the split-brain problem. Quorum-based consensus (requiring a majority of nodes to agree before taking action) prevents it.
Data Loss Risk
If a primary database fails before its latest writes have been replicated to the standby, those writes may be lost during failover. The window of potential data loss is called the recovery point objective (RPO). Synchronous replication (the primary waits for the standby to confirm each write) eliminates this risk but adds latency. Asynchronous replication is faster but accepts a small data loss window.
Replication
Replication keeps copies of data on multiple nodes so that if one fails, others have the data. It is the mechanism that makes redundancy meaningful for stateful systems.
Synchronous Replication
The primary node waits for at least one replica to confirm it has written the data before acknowledging success to the client. Guarantees no data loss on primary failure (RPO = 0). Trades write latency for durability.
Asynchronous Replication
The primary acknowledges success immediately after writing locally, and replicates to replicas in the background. Lower write latency, but if the primary fails before replication completes, recently-written data may be lost (RPO > 0).
Database replication is explored in depth in the Database Replication guide.
Availability vs Consistency
Availability and consistency are in tension in distributed systems. This trade-off is formalized by the CAP theorem: in the presence of a network partition, a distributed system must choose between consistency (all nodes return the same data) and availability (every request gets a response).
Consider a scenario where two database nodes become partitioned from each other:
- Prioritizing consistency: Both nodes stop accepting writes until the partition heals. Users get errors, but they never read stale data. This is the approach of systems like traditional RDBMS clusters.
- Prioritizing availability: Both nodes continue accepting reads and writes independently. When the partition heals, conflicts are resolved. Users always get a response, but they may briefly read stale data. This is the approach of systems like Cassandra and DynamoDB.
In practice, network partitions are rare. The real trade-off is between latency and consistency: synchronous replication is consistent but slow; asynchronous replication is fast but eventually consistent. Most systems choose “strong consistency for writes, eventual consistency for reads” as a practical middle ground — write to the primary synchronously, allow reads from replicas that may lag slightly behind.
Availability Numbers in Series vs Parallel
When components are arranged in series (each request must pass through all of them), the system availability is the product of individual availabilities:
Availability (series) = A1 × A2 × A3 × …
Two 99.9% components in series gives 99.9% × 99.9% = 99.8% — availability degrades as you add components to the critical path.
When components are arranged in parallel (a request succeeds if any one of them is available), the system availability improves:
Availability (parallel) = 1 − (1 − A)^n
Two 99.9% components in parallel gives 1 − (0.001)² = 99.9999% — drastically better. This is the mathematical justification for redundancy: parallel components multiply availability.
| Configuration | Components | Individual availability | System availability |
|---|---|---|---|
| Series | 2 | 99.9% | 99.8% |
| Series | 3 | 99.9% | 99.7% |
| Parallel | 2 | 99.9% | 99.9999% |
| Parallel | 3 | 99.9% | 99.9999999% |
Availability in System Design
When designing a system for high availability, work through each layer and eliminate SPOFs:
Web/Application Tier
Run multiple stateless application servers behind a load balancer. Stateless means any request can be handled by any server — no session state stored locally. The load balancer health-checks all instances and removes failed ones. Use multiple load balancers in active-active or active-passive configuration to avoid making the load balancer itself a SPOF.
Database Tier
Use primary-replica replication with automatic failover. A cluster manager (like Patroni for PostgreSQL or MHA for MySQL) monitors the primary and promotes a replica if it fails. For extreme availability requirements, use multi-primary setups or globally-distributed databases (CockroachDB, Spanner).
Data Center / Region
A single data center is a SPOF. Natural disasters, power grid failures, and major hardware faults can take an entire facility offline. For four-nines or better availability, deploy across multiple availability zones within a region. For five nines or better, deploy across multiple geographic regions with traffic failover at the DNS or load-balancer level.
DNS
DNS failures can make your service unreachable even if all your servers are healthy. Use a highly-available DNS provider (AWS Route 53, Cloudflare) with health-check-based routing. Configure a low TTL on DNS records used for failover so clients pick up changes quickly.
State your availability target early (“we need 99.99% uptime”) and then derive architectural decisions from it. Identify SPOFs at each layer and explain how you eliminate them. Mention the parallel availability formula to justify redundancy. Distinguish between active-active and active-passive for different tiers — stateless web servers are active-active, primary databases are typically active-passive. Bring up the consistency trade-off if asked about multi-region designs. Interviewers look for systematic thinking about failure modes, not just a checklist of “add more servers.”