Production Concerns

Disaster Recovery

● Intermediate ⏱ 11 min read production

Disaster recovery (DR) is the set of policies, procedures, and technologies that enable an organization to restore systems and data after a catastrophic event — hardware failure, data center outage, natural disaster, ransomware, or human error. The difference between a recoverable incident and a business-ending event often comes down to whether DR was planned before the disaster struck.

RTO and RPO

Two metrics define the boundaries of any DR plan:

Metric	Full name	What it measures	Example
RTO	Recovery Time Objective	How long the system can be down before the outage causes unacceptable business damage	4 hours — if systems are down for more than 4 hours, the business suffers significant revenue loss
RPO	Recovery Point Objective	How much data loss is acceptable — measured as the maximum time window of data you can afford to lose	1 hour — losing up to the last hour of transactions is tolerable; losing more is not

RTO drives how fast your recovery process must work. RPO drives how frequently you must back up or replicate data. A payment system might have an RTO of 15 minutes and an RPO of near-zero (synchronous replication). A marketing analytics dashboard might tolerate an RTO of 24 hours and an RPO of 24 hours (nightly backup).

Lower RTO and RPO cost more: they require more infrastructure, more replication bandwidth, and more operational complexity. Match your targets to the actual business cost of downtime and data loss — not to a vague desire for “maximum reliability.”

DR tiers: backup/restore, pilot light, warm standby, and active-active — each trading cost for lower RTO/RPO

Backup Strategies

Backups are the foundation of any DR plan. Three types:

Full backup: A complete snapshot of all data. Simple to restore, but slow to create and storage-intensive. Run weekly or daily for most systems.
Incremental backup: Only the data changed since the last backup (full or incremental). Fast and storage-efficient. Restoration requires replaying the full backup plus every incremental since — more complex and slower to restore.
Differential backup: All data changed since the last full backup. Larger than incremental, but restoring requires only the last full backup plus the latest differential — simpler restore than incremental.

Backup storage: The 3-2-1 rule is the standard: 3 copies of data, on 2 different media types, with 1 copy offsite. A database backed up to local disk and replicated to cloud object storage (S3, GCS) satisfies 3-2-1.

⚠️

Test Your Restores

An untested backup is not a backup — it’s a hope. Backup files corrupted silently; restore procedures fail due to environment changes; backup jobs fail quietly for weeks. Schedule regular restore drills — monthly for critical systems. You discover backup problems during drills, not during disasters.

DR Tiers

DR strategies exist on a spectrum from cheapest-and-slowest to most-expensive-and-fastest:

Tier	Approach	RTO	RPO	Cost
Backup & Restore	Restore from backup when disaster strikes. No standby environment.	Hours to days	Hours (backup frequency)	Lowest
Pilot Light	A minimal skeleton environment is always running — just the core components (database, authentication). Scale up application tier on failover.	30 min – 2 hours	Minutes	Low
Warm Standby	A scaled-down but fully functional replica runs in the DR region. Receives data replication continuously. Scale up on failover.	Minutes to 30 min	Seconds to minutes	Medium
Active-Active (Multi-site)	Multiple regions run at full capacity simultaneously. Traffic is distributed across regions. Failover is seamless — just stop sending traffic to the failed region.	Near-zero	Near-zero	Highest (2× infrastructure)

Failover Patterns

Active-Passive: The primary region handles all traffic. The DR region is idle (or running at reduced capacity). On failure, a runbook or automated system promotes the DR region to primary and redirects DNS or load balancer rules.

Failover steps in an active-passive setup:

Detect failure (health checks, alarms, PagerDuty alert).
Promote the DR database replica to primary.
Update DNS records to point to the DR region (TTL matters — low TTL = faster propagation).
Scale up application tier in DR region if running pilot light.
Validate health in DR region before declaring recovery complete.

Active-Active: Both regions handle traffic simultaneously. A global load balancer (AWS Route 53, Cloudflare, Akamai) distributes requests geographically. If one region fails, the load balancer routes all traffic to the healthy region. No failover procedure — just capacity planning to ensure one region can handle peak load alone.

Active-active is harder to build correctly: writes must be handled consistently across regions (synchronous replication is expensive; asynchronous risks conflicts). Multi-region write consistency requires careful design — conflict-free replicated data types (CRDTs), last-write-wins, or region-local writes with periodic merge.

Data Replication

The RPO of your system is bounded by how frequently data is replicated to the DR region.

Synchronous replication: Every write is confirmed on both primary and DR before acknowledging to the client. Zero data loss (RPO = 0). Latency cost: every write waits for the cross-region round trip (50–150ms for intercontinental). Limits write throughput. Used for financial systems, payment data.
Asynchronous replication: Writes are acknowledged immediately; replication happens in the background. Latency impact is minimal. Data loss risk = replication lag at the time of failure (seconds to minutes typically). Used for most web applications.
Log shipping: Database transaction logs (WAL in PostgreSQL, binlog in MySQL) are shipped to the DR region and replayed. Similar to asynchronous replication but at the log level. Widely supported and simpler than full bidirectional replication.

💡

DNS TTL is Part of Your RTO

If your DNS records have a TTL of 24 hours, failover takes up to 24 hours even if your systems switch instantly — clients are still resolving the old IP. Set TTLs to 60–300 seconds for critical records before you need to fail over. Many DNS providers let you lower TTL preemptively when an incident is brewing.

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into production systems to verify they behave as expected under failure conditions. Netflix’s Chaos Monkey (randomly terminates EC2 instances) is the canonical example.

The principle: if you don’t test failure, you don’t know what happens during failure. A DR plan that has never been tested is a guess about what will happen. Chaos engineering replaces guessing with evidence.

Chaos experiments at different levels:

Instance failure: Kill a random application server. Does the load balancer detect and reroute? Does the autoscaler replace it?
Dependency failure: Block network traffic to a downstream service. Do circuit breakers open? Does the application degrade gracefully?
Zone failure: Simulate an availability zone outage. Does traffic shift to other zones? Do health checks detect all the affected instances?
Region failure: Simulate a full regional outage. Does the DR failover process work end to end within the RTO target?

Chaos experiments start small (single instance in staging), build confidence, and gradually expand scope (production, zone-level). Track a steady-state hypothesis — “error rate stays below 0.1%” — and verify the system recovers within defined bounds.

Design Considerations

Define RTO and RPO first. DR architecture follows from the numbers. Without them, every design decision is arbitrary. Get explicit agreement from stakeholders on what downtime is acceptable and at what cost before designing the recovery system.
Document the runbook before the incident. A DR runbook written during an outage is useless — engineers are stressed, documentation is incomplete, and steps are ambiguous. Write the runbook when systems are healthy; test it during chaos drills.
Automate failover where possible. Manual failover introduces human error, delays, and on-call fatigue. Automate DNS failover, database promotion, and health check-based traffic shifting. Reserve manual steps for edge cases where automation would be dangerous (ambiguous split-brain scenarios).
Separate DR from backups. Backups protect against data corruption and accidental deletion. DR protects against infrastructure unavailability. You need both. A database in a DR region that mirrors a corrupted primary is not a backup — it’s a mirror of corruption.
Account for application state. Stateless services fail over easily; stateful services require session migration or sticky session invalidation. In-flight transactions, user sessions, and local caches must all be considered in the failover plan. Prefer stateless architectures — they dramatically simplify DR.
Monitor the DR environment. The DR region must be healthy before you need it. Monitor its infrastructure, test its health checks, and verify replication lag continuously. A DR region you haven’t checked in six months may have accumulated drift from the primary.