Production Concerns

Circuit Breaker

● Intermediate ⏱ 11 min read production

The circuit breaker is a resilience pattern that prevents a failing downstream service from taking down the services that depend on it. Named after the electrical circuit breaker that cuts power when current exceeds a safe threshold, it detects when a dependency is failing and “opens” — stopping all requests to that service for a period. This gives the failing service time to recover and prevents the caller from wasting threads waiting for timeouts.

Cascading Failures

In a microservices system, services call each other. Service A calls Service B; B calls C; C calls D. If D becomes slow or unresponsive, every request through the chain waits for D to respond or timeout. Threads in A, B, and C pile up holding open connections to their downstream dependencies. Connection pools exhaust. Memory fills with queued requests. A slow response from one service at the bottom of the call chain takes down the entire stack — a cascading failure.

Without a circuit breaker, even a 5-second timeout on each hop compounds: a user request that touches A → B → C → D can take 15+ seconds before returning an error, by which time hundreds of threads are blocked waiting for a service that’s clearly not responding.

How the Circuit Breaker Works

A circuit breaker wraps every call to an external dependency. It monitors the results of those calls — successes, failures, and timeouts. When the failure rate crosses a threshold, it “opens” the circuit: subsequent calls immediately return an error (or a fallback) without hitting the downstream service. After a configured wait, it allows a small number of test requests through. If those succeed, it “closes” the circuit and resumes normal operation.

Circuit breaker states: Closed (normal), Open (blocking), Half-Open (testing recovery)

The Three States

Closed (normal operation): Requests flow through to the downstream service. The circuit breaker counts successes and failures within a sliding window. If the failure rate stays below the threshold (e.g., <50% of the last 20 calls), the circuit remains closed.

Open (tripped): The failure threshold was exceeded. The circuit breaker immediately rejects all calls — no request reaches the downstream service. The caller receives a fast error or fallback response. This state lasts for a configured duration (e.g., 30 seconds), called the “sleep window.”

Half-Open (recovery probe): After the sleep window expires, the circuit transitions to half-open. A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit re-opens for another sleep window. This probing prevents thundering herds: instead of all callers rushing back the moment a service recovers, only probe requests test the water first.

💡
Fast Failure vs Slow Timeout

An open circuit breaker fails fast — it returns an error in microseconds instead of waiting 5–30 seconds for a timeout. This means threads are freed immediately, connection pools don’t exhaust, and the user gets a degraded-but-fast response rather than a hung UI. Fast failure is almost always preferable to slow failure in distributed systems.

Fallbacks

A circuit breaker without a fallback just turns downstream failures into upstream errors. A good implementation pairs the circuit breaker with a fallback strategy:

Tuning Thresholds

Misconfigured circuit breakers create new problems. Tune based on real traffic patterns:

ParameterToo lowToo highTypical starting point
Failure thresholdOpens on transient errorsOpens too late, threads already exhausted50% of last 20 requests
Minimum request countOpens on 1 failure in low trafficToo slow to detect problems20 requests per window
Sleep windowHalf-opens too soon; re-trips immediatelyService recovers but circuit stays open too long30–60 seconds
Probe count (half-open)One bad probe re-opens the circuitToo many probes hit a recovering service3–5 probe requests

Set the minimum request count high enough that a few failures during low-traffic periods (nights, weekends) don’t trip the circuit. Use a percentage threshold, not an absolute count, so the circuit scales with traffic volume.

Implementations

Resilience4j (Java): The standard circuit breaker library for JVM services. Supports count-based and time-based sliding windows, bulkheads, retry, and rate limiting. Integrates with Spring Boot, Micronaut, and Quarkus.

Hystrix (Java, deprecated): Netflix’s original circuit breaker. Widely referenced but no longer actively maintained. Resilience4j is its successor.

Polly (.NET): Resilience policies for C# — circuit breaker, retry, timeout, bulkhead, and fallback. Composable into a “policy wrap.”

Envoy Proxy / Istio: Circuit breaking at the infrastructure layer. Configured via Envoy’s outlier detection — Envoy tracks error rates per upstream host and ejects unhealthy hosts from the load balancing pool. This applies circuit breaking without changing application code.

AWS App Mesh / GCP Traffic Director: Service mesh solutions that provide circuit breaking, retries, and timeout policies as infrastructure configuration.

Design Considerations