Databases

Distributed Transactions

● Advanced ⏱ 15 min read database

A distributed transaction coordinates multiple participants — separate databases, services, or nodes — so that either all of them commit their changes or all of them roll back. This is harder than a local transaction because participants can fail independently, network messages can be lost or delayed, and there is no single process with visibility into the global state.

The Problem

Imagine an e-commerce order: the Order Service must insert a row in its database, the Inventory Service must decrement stock in its database, and the Payment Service must charge the customer. All three must succeed together or none should take effect.

In a single database, a transaction handles this. Across three separate services with separate databases, there is no built-in transaction boundary. If the Order Service succeeds but the Inventory Service crashes mid-operation, you have an order with no reserved stock. This is a distributed systems consistency problem.

⚠️
Why Not Just Use a Shared Database?

A single database shared by multiple services solves the distributed transaction problem but couples those services at the storage layer. Schema changes in one service affect others; one service's write volume affects others; you lose independent deployability. Distributed transactions exist precisely because you want service independence and cannot share a database.

Two-Phase Commit (2PC)

Two-Phase Commit is the classical protocol for coordinating distributed transactions. It uses a coordinator (often the initiating service or a transaction manager) and multiple participants (the databases or services involved).

Phase 1 — Prepare (Voting):

  1. The coordinator sends a PREPARE message to all participants.
  2. Each participant performs its local work and writes the changes to its WAL (but does not commit).
  3. Each participant responds YES (prepared, can commit) or NO (failed, cannot commit).

Phase 2 — Commit (or Abort):

  1. If all participants voted YES, the coordinator writes a commit record and sends COMMIT to all participants.
  2. If any participant voted NO (or timed out), the coordinator sends ABORT to all participants.
  3. Each participant applies the commit or rolls back, then releases locks and acknowledges the coordinator.
Two-Phase Commit: coordinator orchestrates prepare then commit across all participants

2PC provides atomicity across participants. A participant that voted YES is "locked in" — it has promised to commit and will hold its locks until it hears from the coordinator. This guarantee is what makes 2PC correct.

2PC Failure Modes

2PC's correctness comes at the cost of blocking on coordinator failure:

Coordinator crashes after Phase 1 but before Phase 2: Participants that voted YES are in a state called "in-doubt" — they have locked resources and are waiting for the coordinator's decision. They cannot unilaterally commit or abort without risking inconsistency. They hold their locks indefinitely until the coordinator recovers and completes Phase 2. This blocking is 2PC's fundamental limitation.

Participant crashes after voting YES: When the participant recovers, it checks its WAL and finds the prepared state. It contacts the coordinator to learn the decision and applies it. Because it wrote its work to the WAL before voting YES, it can honor the commit even after a crash.

Network partition: If the coordinator cannot reach a participant during Phase 2, it retries until the participant is reachable. If the participant cannot reach the coordinator after voting YES, it blocks. This is the "in-doubt" window: the participant doesn't know whether to commit or abort.

💡
2PC in Practice

Despite its limitations, 2PC is widely used for database-level distributed transactions: PostgreSQL's PREPARE TRANSACTION / COMMIT PREPARED, MySQL's XA transactions, and Java's JTA/XA spec all implement 2PC. It works well for transactions that span databases within a controlled infrastructure where coordinator downtime is bounded and in-doubt windows are short.

Three-Phase Commit (3PC)

3PC adds a third phase to eliminate the in-doubt blocking of 2PC. Between Prepare and Commit, it adds a Pre-Commit phase where the coordinator informs all participants of the decision before applying it. Each participant can now safely abort if it doesn't receive the pre-commit within a timeout — because if any participant had to abort, the coordinator would not have sent pre-commit.

In practice, 3PC is rarely used. It requires synchronous messaging assumptions that are hard to guarantee over real networks, and the additional round trip makes it slower than 2PC. Modern systems prefer 2PC with short coordinator failure windows (via leader election) or alternative approaches like Sagas.

The Saga Pattern

A Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect if a later step fails.

For the order example:

  1. Order Service: INSERT order → compensating: DELETE order
  2. Inventory Service: DECREMENT stock → compensating: INCREMENT stock
  3. Payment Service: CHARGE customer → compensating: REFUND customer

If step 3 (Payment) fails, the Saga executes compensating transactions in reverse: refund the customer (already charged? No — payment failed, so no refund needed), increment inventory back, delete the order. Each participant commits its local transaction immediately — there are no global locks held across steps.

Sagas trade atomicity for availability and independence:

2PCSaga
AtomicityTrue atomicity (all-or-nothing)Eventual atomicity via compensation
Locks heldAcross all participants during protocolOnly during each local transaction
AvailabilityBlocked on coordinator failureEach step can proceed independently
IsolationConfigurable per participantNo global isolation — interim states visible
Failure handlingCoordinator manages rollbackApplication must write compensating logic

The key limitation of Sagas is that intermediate states are visible. Between steps 1 and 3 of the order example, the order exists and inventory is decremented but payment hasn't been charged. If another process queries the system in this window, it sees a partially complete state. Applications must tolerate this.

Choreography vs Orchestration

Sagas can be implemented in two styles:

Choreography: Each service publishes events after completing its local transaction. Other services subscribe to those events and execute their steps. There is no central coordinator — the saga flow emerges from the event subscriptions. Simple to implement for short sagas; becomes hard to reason about as the number of services grows.

OrderService → publishes OrderCreated
InventoryService → listens to OrderCreated → reserves stock → publishes StockReserved
PaymentService → listens to StockReserved → charges customer → publishes PaymentProcessed
OrderService → listens to PaymentProcessed → marks order confirmed

Orchestration: A central saga orchestrator (a dedicated service or state machine) calls each participant in sequence and decides what to do next based on responses. The orchestrator explicitly manages forward steps and compensating transactions. Easier to understand and debug; the orchestrator is a new component to manage.

For complex sagas with many steps or non-linear flows, orchestration is generally preferable. Choreography works well for simple two- or three-step flows.

The Outbox Pattern

A common problem in distributed systems: after a local transaction commits, you need to publish an event to a message broker. If the broker call fails, the event is never published, leaving downstream services unaware. If you publish before committing, and then the commit fails, you've published an event for a transaction that didn't happen.

The outbox pattern solves this by making event publication part of the local transaction:

  1. Within the same local transaction, write your business data and an outbox row (the event to publish).
  2. Commit the local transaction — both the business data and the outbox row are atomically committed.
  3. A separate background process (the outbox relay) reads unpublished outbox rows and publishes them to the message broker, then marks them as published.

The outbox relay is idempotent — if it fails and retries, the message broker receives the event again. Consumers must handle duplicate messages (idempotent consumers). In exchange, you get exactly-once delivery semantics at the producer side with no distributed transaction needed.

Change Data Capture as the Outbox Relay

Instead of a polling outbox relay, many teams use Change Data Capture (CDC) tools like Debezium to stream the database WAL directly. When the outbox table gets a new row, Debezium publishes it to Kafka automatically — no polling, lower latency, and no extra queries on the main database.

Design Considerations