Production Concerns

Disaster Recovery

● Intermediate ⏱ 11 min read production

Disaster recovery (DR) is the set of policies, procedures, and technologies that enable an organization to restore systems and data after a catastrophic event — hardware failure, data center outage, natural disaster, ransomware, or human error. The difference between a recoverable incident and a business-ending event often comes down to whether DR was planned before the disaster struck.

RTO and RPO

Two metrics define the boundaries of any DR plan:

MetricFull nameWhat it measuresExample
RTORecovery Time ObjectiveHow long the system can be down before the outage causes unacceptable business damage4 hours — if systems are down for more than 4 hours, the business suffers significant revenue loss
RPORecovery Point ObjectiveHow much data loss is acceptable — measured as the maximum time window of data you can afford to lose1 hour — losing up to the last hour of transactions is tolerable; losing more is not

RTO drives how fast your recovery process must work. RPO drives how frequently you must back up or replicate data. A payment system might have an RTO of 15 minutes and an RPO of near-zero (synchronous replication). A marketing analytics dashboard might tolerate an RTO of 24 hours and an RPO of 24 hours (nightly backup).

Lower RTO and RPO cost more: they require more infrastructure, more replication bandwidth, and more operational complexity. Match your targets to the actual business cost of downtime and data loss — not to a vague desire for “maximum reliability.”

DR tiers: backup/restore, pilot light, warm standby, and active-active — each trading cost for lower RTO/RPO

Backup Strategies

Backups are the foundation of any DR plan. Three types:

Backup storage: The 3-2-1 rule is the standard: 3 copies of data, on 2 different media types, with 1 copy offsite. A database backed up to local disk and replicated to cloud object storage (S3, GCS) satisfies 3-2-1.

⚠️
Test Your Restores

An untested backup is not a backup — it’s a hope. Backup files corrupted silently; restore procedures fail due to environment changes; backup jobs fail quietly for weeks. Schedule regular restore drills — monthly for critical systems. You discover backup problems during drills, not during disasters.

DR Tiers

DR strategies exist on a spectrum from cheapest-and-slowest to most-expensive-and-fastest:

TierApproachRTORPOCost
Backup & RestoreRestore from backup when disaster strikes. No standby environment.Hours to daysHours (backup frequency)Lowest
Pilot LightA minimal skeleton environment is always running — just the core components (database, authentication). Scale up application tier on failover.30 min – 2 hoursMinutesLow
Warm StandbyA scaled-down but fully functional replica runs in the DR region. Receives data replication continuously. Scale up on failover.Minutes to 30 minSeconds to minutesMedium
Active-Active (Multi-site)Multiple regions run at full capacity simultaneously. Traffic is distributed across regions. Failover is seamless — just stop sending traffic to the failed region.Near-zeroNear-zeroHighest (2× infrastructure)

Failover Patterns

Active-Passive: The primary region handles all traffic. The DR region is idle (or running at reduced capacity). On failure, a runbook or automated system promotes the DR region to primary and redirects DNS or load balancer rules.

Failover steps in an active-passive setup:

  1. Detect failure (health checks, alarms, PagerDuty alert).
  2. Promote the DR database replica to primary.
  3. Update DNS records to point to the DR region (TTL matters — low TTL = faster propagation).
  4. Scale up application tier in DR region if running pilot light.
  5. Validate health in DR region before declaring recovery complete.

Active-Active: Both regions handle traffic simultaneously. A global load balancer (AWS Route 53, Cloudflare, Akamai) distributes requests geographically. If one region fails, the load balancer routes all traffic to the healthy region. No failover procedure — just capacity planning to ensure one region can handle peak load alone.

Active-active is harder to build correctly: writes must be handled consistently across regions (synchronous replication is expensive; asynchronous risks conflicts). Multi-region write consistency requires careful design — conflict-free replicated data types (CRDTs), last-write-wins, or region-local writes with periodic merge.

Data Replication

The RPO of your system is bounded by how frequently data is replicated to the DR region.

💡
DNS TTL is Part of Your RTO

If your DNS records have a TTL of 24 hours, failover takes up to 24 hours even if your systems switch instantly — clients are still resolving the old IP. Set TTLs to 60–300 seconds for critical records before you need to fail over. Many DNS providers let you lower TTL preemptively when an incident is brewing.

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into production systems to verify they behave as expected under failure conditions. Netflix’s Chaos Monkey (randomly terminates EC2 instances) is the canonical example.

The principle: if you don’t test failure, you don’t know what happens during failure. A DR plan that has never been tested is a guess about what will happen. Chaos engineering replaces guessing with evidence.

Chaos experiments at different levels:

Chaos experiments start small (single instance in staging), build confidence, and gradually expand scope (production, zone-level). Track a steady-state hypothesis — “error rate stays below 0.1%” — and verify the system recovers within defined bounds.

Design Considerations