Foundations

Availability

● Intermediate ⏱ 14 min read reliability

Every system fails eventually. Hard drives fail, networks partition, software panics, data centers flood. Availability is the measure of how often your system is usable when users need it. Designing for high availability means accepting that individual components will fail and building systems that survive those failures — staying online even as parts break around them. This is one of the most important properties in distributed system design, and nearly every architectural decision has availability implications.

What Is Availability?

Availability is the fraction of time a system is operational and able to serve requests. More precisely, it is the probability that a system is functioning correctly at any given moment.

Availability is typically expressed as a percentage:

Availability = Uptime / (Uptime + Downtime)

A system that is down for 1 hour in a month has an availability of roughly 99.86%. That sounds impressive, but 1 hour of downtime during peak traffic can be catastrophic for a business-critical service.

💡
Availability vs Reliability

Availability is about uptime: is the system reachable right now? Reliability is about correctness over time: does the system produce the right results consistently? A system can be highly available (always up) but unreliable (returning wrong data). In practice, both matter — but availability is the more commonly specified SLA metric.

Measuring Availability: The Nines

Availability targets are described in “nines” — the number of nines in the percentage. Each additional nine is an order-of-magnitude reduction in allowed downtime:

AvailabilityNinesDowntime / yearDowntime / monthDowntime / week
90%One nine36.5 days~73 hours~16.8 hours
99%Two nines3.65 days~7.3 hours~1.68 hours
99.9%Three nines8.77 hours~43.8 min~10.1 min
99.99%Four nines52.6 min~4.4 min~1.01 min
99.999%Five nines5.26 min~26.3 sec~6.05 sec
99.9999%Six nines31.5 sec~2.6 sec~0.6 sec

The jump from three nines (8.77 hours/year) to four nines (52.6 minutes/year) requires significant architectural investment. The jump from four to five nines is even harder and usually demands active-active multi-region deployments.

What Target Should You Choose?

Most web services target 99.9% (three nines). Financial, medical, and telecommunications systems often target 99.99% or beyond. Six nines is rare and typically applies only to critical infrastructure like telephone switching systems. The cost of each additional nine grows non-linearly — achieving 99.999% may cost 10–100× more than 99.9%.

High Availability

High availability (HA) is the design goal of keeping a system operational despite component failures. HA systems are built on two principles:

  1. Eliminate single points of failure (SPOFs) — any component whose failure would bring down the entire system.
  2. Design for graceful degradation — when parts fail, the system continues serving requests at reduced capacity rather than failing completely.

Single Points of Failure

A single point of failure is any component that, if it fails, causes the entire system to fail. Common SPOFs include:

The process of designing for HA is largely the process of systematically identifying and eliminating SPOFs by adding redundancy at each layer.

Graceful Degradation

A well-designed HA system degrades gracefully rather than failing catastrophically. If the recommendation service is down, the main product pages still load — just without personalized recommendations. If the image CDN has issues, pages render with alt text rather than broken images. Features fail independently; the core path remains available.

⚠️
Cascading Failures

The opposite of graceful degradation is a cascading failure: one component fails, overloading its neighbors, which then fail, propagating failure through the entire system. This is often worse than a single SPOF because the failure is non-obvious and hard to contain. Circuit breakers (covered in a later guide) are the primary tool for preventing cascading failures.

Redundancy

Redundancy means having multiple instances of a component so that if one fails, others continue to serve requests. It is the foundation of high availability.

Active-Passive (Standby) Redundancy

One instance is active (handling traffic) while one or more passive instances stand by in reserve. If the active instance fails, a passive instance is promoted to active and takes over.

Active-Active Redundancy

All instances are active simultaneously, sharing the incoming load. A load balancer distributes requests across all of them. If one instance fails, the load balancer removes it from rotation and the remaining instances absorb its share of traffic.

Active-PassiveActive-Active
Traffic handlingSingle active instanceAll instances
Resource utilization50% (standby is idle)100%
Failover timeSeconds to minutesNear-instant
ComplexityLowerHigher
Best forDatabases, stateful servicesStateless services, web tiers

Failover

Failover is the process of automatically switching to a redundant component when the primary one fails. It consists of three steps:

  1. Detection — health checks determine that a component has failed (no response, wrong status, timeout).
  2. Decision — the system decides to switch traffic away from the failed component.
  3. Switchover — traffic is redirected to the healthy component.

The speed of failover is determined by how quickly failure is detected. Health check intervals and timeout thresholds control this. A health check every 10 seconds with a 3-failure threshold means detection takes up to 30 seconds before failover begins. Reducing this improves recovery time but increases false positives (declaring a healthy-but-slow node as failed).

Failover Challenges

Split-Brain

In active-passive setups, if the standby incorrectly believes the primary has failed (due to a network partition), both nodes may promote themselves to primary simultaneously. Both start accepting writes, leading to conflicting data. This is the split-brain problem. Quorum-based consensus (requiring a majority of nodes to agree before taking action) prevents it.

Data Loss Risk

If a primary database fails before its latest writes have been replicated to the standby, those writes may be lost during failover. The window of potential data loss is called the recovery point objective (RPO). Synchronous replication (the primary waits for the standby to confirm each write) eliminates this risk but adds latency. Asynchronous replication is faster but accepts a small data loss window.

Replication

Replication keeps copies of data on multiple nodes so that if one fails, others have the data. It is the mechanism that makes redundancy meaningful for stateful systems.

Synchronous Replication

The primary node waits for at least one replica to confirm it has written the data before acknowledging success to the client. Guarantees no data loss on primary failure (RPO = 0). Trades write latency for durability.

Asynchronous Replication

The primary acknowledges success immediately after writing locally, and replicates to replicas in the background. Lower write latency, but if the primary fails before replication completes, recently-written data may be lost (RPO > 0).

Database replication is explored in depth in the Database Replication guide.

Availability vs Consistency

Availability and consistency are in tension in distributed systems. This trade-off is formalized by the CAP theorem: in the presence of a network partition, a distributed system must choose between consistency (all nodes return the same data) and availability (every request gets a response).

Consider a scenario where two database nodes become partitioned from each other:

💡
Most Systems Need Both

In practice, network partitions are rare. The real trade-off is between latency and consistency: synchronous replication is consistent but slow; asynchronous replication is fast but eventually consistent. Most systems choose “strong consistency for writes, eventual consistency for reads” as a practical middle ground — write to the primary synchronously, allow reads from replicas that may lag slightly behind.

Availability Numbers in Series vs Parallel

When components are arranged in series (each request must pass through all of them), the system availability is the product of individual availabilities:

Availability (series) = A1 × A2 × A3 × …

Two 99.9% components in series gives 99.9% × 99.9% = 99.8% — availability degrades as you add components to the critical path.

When components are arranged in parallel (a request succeeds if any one of them is available), the system availability improves:

Availability (parallel) = 1 − (1 − A)^n

Two 99.9% components in parallel gives 1 − (0.001)² = 99.9999% — drastically better. This is the mathematical justification for redundancy: parallel components multiply availability.

ConfigurationComponentsIndividual availabilitySystem availability
Series299.9%99.8%
Series399.9%99.7%
Parallel299.9%99.9999%
Parallel399.9%99.9999999%

Availability in System Design

When designing a system for high availability, work through each layer and eliminate SPOFs:

Web/Application Tier

Run multiple stateless application servers behind a load balancer. Stateless means any request can be handled by any server — no session state stored locally. The load balancer health-checks all instances and removes failed ones. Use multiple load balancers in active-active or active-passive configuration to avoid making the load balancer itself a SPOF.

Database Tier

Use primary-replica replication with automatic failover. A cluster manager (like Patroni for PostgreSQL or MHA for MySQL) monitors the primary and promotes a replica if it fails. For extreme availability requirements, use multi-primary setups or globally-distributed databases (CockroachDB, Spanner).

Data Center / Region

A single data center is a SPOF. Natural disasters, power grid failures, and major hardware faults can take an entire facility offline. For four-nines or better availability, deploy across multiple availability zones within a region. For five nines or better, deploy across multiple geographic regions with traffic failover at the DNS or load-balancer level.

DNS

DNS failures can make your service unreachable even if all your servers are healthy. Use a highly-available DNS provider (AWS Route 53, Cloudflare) with health-check-based routing. Configure a low TTL on DNS records used for failover so clients pick up changes quickly.

💡
In System Design Interviews

State your availability target early (“we need 99.99% uptime”) and then derive architectural decisions from it. Identify SPOFs at each layer and explain how you eliminate them. Mention the parallel availability formula to justify redundancy. Distinguish between active-active and active-passive for different tiers — stateless web servers are active-active, primary databases are typically active-passive. Bring up the consistency trade-off if asked about multi-region designs. Interviewers look for systematic thinking about failure modes, not just a checklist of “add more servers.”