Database Replication
Replication is the practice of keeping copies of the same data on multiple database servers. It is the foundation of high availability, read scaling, and disaster recovery for nearly every production database deployment. Understanding how data flows from primary to replicas, where consistency breaks down, and how failover works prepares you to reason about the reliability of any data layer.
Why Replication?
A single database server is a single point of failure. When it goes down — due to hardware failure, OS crash, network partition, or maintenance — all reads and writes fail. Replication solves three distinct problems:
- High availability: If the primary fails, promote a replica to become the new primary. Downtime measured in seconds or minutes, not hours.
- Read scaling: Distribute read queries across multiple replicas. A single primary can support a handful of read replicas, multiplying read throughput without sharding.
- Geographic distribution: Place replicas in different regions to serve local reads with low latency. A user in Tokyo reads from a Tokyo replica instead of a US primary.
Primary-Replica Replication
The most common replication topology. One server is designated the primary (also called master or leader); one or more servers are replicas (also called standbys, followers, or read replicas). All writes go to the primary. The primary ships changes to replicas, which apply them in the same order.
How WAL-Based Replication Works
In PostgreSQL, MySQL, and most SQL databases, replication is driven by the Write-Ahead Log:
- The primary writes every committed transaction to its WAL before acknowledging the commit.
- Replicas connect to the primary and stream the WAL continuously.
- Each replica applies WAL entries to its own storage, bringing its data in sync with the primary.
- The replica acknowledges receipt (and optionally application) of each WAL position back to the primary.
This is called streaming replication (PostgreSQL) or binary log replication (MySQL). The replica is a byte-for-byte copy of the primary — not just data, but the same physical layout, indexes, and sequences.
What Replicas Can Do
Read-only queries can be routed to replicas. Analytics queries, reporting, and background jobs that generate heavy read load can be offloaded to a dedicated replica without impacting the primary. Many applications use a read/write connection pool for the primary and a separate read-only pool that load-balances across replicas.
Writing to a replica while the primary is also writing causes divergence — the replica’s state no longer matches the primary. If you then promote the replica, you lose the primary’s writes or corrupt the replica’s data. Replicas should be strictly read-only. Most databases enforce this, but double-check your replica configuration.
Synchronous vs Asynchronous Replication
The critical question is: when does the primary acknowledge a write to the client?
Asynchronous Replication
The primary writes to its WAL, acknowledges the commit to the client, and then ships the WAL to replicas in the background. The client gets a fast acknowledgment. The replica catches up shortly after.
Pros: Low write latency. The primary does not wait for any replica. Even if all replicas are down, the primary continues accepting writes.
Cons: If the primary crashes before the replica receives the WAL entries, those committed transactions are lost. The replica that gets promoted is missing the most recent writes — potential data loss of up to the replication lag at the time of failure.
This is the default in MySQL and PostgreSQL. For most workloads, the risk of losing a few seconds of writes during a primary crash is acceptable — but for financial data, it may not be.
Synchronous Replication
The primary only acknowledges a commit after at least one replica has written the WAL entry to its durable storage (disk). The client waits for this round-trip.
Pros: Zero data loss on primary failure. If the primary crashes the instant after acknowledging a commit, the replica has the data. Promote the replica with no data loss.
Cons: Write latency increases by one network round-trip to the replica. If the synchronous replica goes down, the primary stops accepting writes (waiting for a replica that will never acknowledge). Must be configured carefully to avoid this availability problem.
PostgreSQL lets you configure synchronous_standby_names to designate which replicas must acknowledge before the primary commits. A common pattern: require one synchronous replica for durability, keep others asynchronous for low-latency reads.
Semi-Synchronous Replication
A middle ground offered by MySQL. The primary waits for at least one replica to receive the WAL (not necessarily apply it) before acknowledging. Reduces the data loss window without requiring replicas to apply changes synchronously. Falls back to asynchronous if no replica acknowledges within a timeout.
| Asynchronous | Semi-Synchronous | Synchronous | |
|---|---|---|---|
| Write latency | Low (local only) | Medium (+network RTT) | High (+network RTT + replica disk write) |
| Data loss on primary failure | Up to replication lag | Near-zero | Zero |
| Primary availability if replica fails | Unaffected | Falls back to async | Blocks (if required replica is down) |
| Default in | PostgreSQL, MySQL | MySQL semi-sync plugin | Optional in PostgreSQL, CockroachDB |
Replication Lag
Replication lag is the delay between a write on the primary and its visibility on a replica. It is always present to some degree in asynchronous replication. During normal operation it is milliseconds. Under heavy write load or a slow replica, it can grow to seconds or minutes.
Problems Caused by Replication Lag
Read-your-own-writes violation: A user updates their profile and is immediately redirected to a page that reads from a replica. The replica hasn’t caught up. The user sees their old profile. The fix: for a short window after a write, route that user’s reads to the primary, or wait for the replica to catch up before the redirect.
Monotonic reads violation: A user reads from replica A (which has applied up to log position 1000), then reads from replica B (which is only at position 950). The user sees newer data and then older data — time goes backward. Fix: pin the user to the same replica for the session, or use a load balancer that enforces replica-stickiness.
Causality violation: Process A writes a comment on a post. Process B reads the post (from a replica that caught up) and then tries to read the comment (from a slower replica that hasn’t). The comment is invisible even though the post indicates it should exist. Fix: use primary reads for operations where causal consistency matters, or use a system that tracks causal dependencies (Lamport timestamps).
Monitoring Replication Lag
In PostgreSQL:
SELECT client_addr,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
In MySQL:
SHOW REPLICA STATUS\G
-- look at Seconds_Behind_Source
Alert if lag exceeds your tolerable threshold (commonly 30–60 seconds). Persistent high lag indicates the replica cannot keep up — either under-specced hardware, a long-running transaction on the primary holding back replication, or a table lock on the replica preventing WAL application.
Multi-Primary Replication
In multi-primary (also called multi-master or active-active) replication, multiple nodes accept writes. Every node replicates its writes to all other nodes. This sounds appealing — writes to any node, reads from any node — but it introduces write conflicts.
If two nodes simultaneously update the same row, both updates are valid locally. When the updates propagate to the other node, the database must resolve the conflict. Resolution strategies:
- Last-write-wins: The write with the higher timestamp wins. The other is discarded. Problem: clocks are not perfectly synchronized across nodes.
- Application-defined merge: The application is notified of conflicts and decides how to merge. Complex but correct.
- CRDTs (Conflict-free Replicated Data Types): Data structures that can always be merged without conflicts (counters, sets, ordered lists). Used by systems like Riak.
Multi-primary is rarely used for SQL databases because conflict resolution for relational data is extremely complex. It is more common in geographically distributed NoSQL systems (DynamoDB global tables, CouchDB) where the data model is simpler and last-write-wins is tolerable.
Systems like CockroachDB and Google Spanner look like multi-primary but are actually coordinated through distributed consensus (Raft or Paxos). Writes require a quorum of nodes to agree before committing. There are no conflicts because the consensus protocol serializes all writes. The cost is write latency (a network round-trip to achieve quorum). This is the correct way to build a globally distributed SQL database — not naive multi-primary replication.
Failover and Promotion
When the primary fails, a replica must be promoted to take over. The process:
- Detect the failure. A health check or consensus protocol determines the primary is unavailable. Typically requires multiple consecutive failed health checks to avoid false positives (network blip vs. actual crash).
- Select the new primary. The most up-to-date replica (lowest replication lag) is preferred to minimize data loss. If multiple replicas are at the same log position, break ties by priority or recency.
- Promote the replica. The replica’s WAL receiver is stopped; it transitions to primary mode and starts accepting writes.
- Redirect traffic. The application must be told to connect to the new primary. This is done via a virtual IP (VIP) that floats between nodes, a DNS TTL change, or a proxy layer (PgBouncer, HAProxy, ProxySQL) that is reconfigured automatically.
- Fence the old primary. STONITH (Shoot The Other Node In The Head) — ensure the old primary cannot accept writes after the new one is promoted, to prevent split-brain. This usually involves a fencing agent that powers off the old node or revokes its network access.
Automated Failover Tools
- PostgreSQL: Patroni (etcd/Consul/ZooKeeper for consensus), Repmgr, pgAuto Failover. AWS RDS and Aurora handle failover automatically.
- MySQL: Orchestrator, MHA (MySQL High Availability), ProxySQL + Orchestrator, AWS RDS Multi-AZ.
Target recovery time (RTO): automated failover typically takes 30–60 seconds. Manual failover takes minutes to hours. With synchronous replication, recovery point objective (RPO) is zero. With asynchronous, RPO equals replication lag at the time of failure.
Replication in NoSQL
NoSQL databases use replication too, but with different trade-offs baked into their design.
Cassandra: Every node is equal — no primary/replica distinction. Data is replicated across N nodes (replication factor, e.g. RF=3). Any node can accept reads or writes. Consistency is tunable per operation: write to ONE for low latency, QUORUM for stronger consistency, ALL for maximum durability.
DynamoDB: AWS manages replication across three availability zones transparently. You configure read/write capacity; DynamoDB handles all replication internally. Global Tables add cross-region replication with last-write-wins conflict resolution.
MongoDB: Uses a replica set — one primary and up to 49 secondaries. Writes go to the primary; secondaries replicate via oplog streaming. An election (using Raft) automatically promotes a secondary if the primary fails. Read preference can be configured to allow reads from secondaries.
Design Considerations
When designing a system that uses replication, address these questions explicitly:
- What is your RPO? How much data loss is acceptable if the primary crashes? Zero → synchronous replication. Seconds → asynchronous is fine.
- What is your RTO? How quickly must reads and writes resume after a primary failure? Sub-minute → automated failover with health checks. Minutes → manual failover may be acceptable.
- Do your reads tolerate stale data? Analytics and reporting → read from replica, accept lag. User-facing profile reads → use read-your-own-writes routing. Financial balances → always read from primary.
- How many replicas? At minimum two replicas for HA (one fails, one remains for failover). Add more for geographic distribution or heavy read traffic. Each replica adds replication overhead to the primary.
- How do you monitor lag? Set up alerts on replica lag. A replica 10 minutes behind is not providing meaningful high availability — if you promote it, you lose 10 minutes of writes.
When discussing a database layer, always mention replication explicitly. “We run a PostgreSQL primary with two replicas. Application reads are load-balanced across replicas; writes go to the primary. We use synchronous replication to one replica for zero-RPO durability, and asynchronous to the second for a cross-region read replica. Patroni with etcd handles automated failover with a 30-second RTO.” This level of specificity signals genuine experience with production databases.