Production Concerns

Circuit Breaker

● Intermediate ⏱ 11 min read production

The circuit breaker is a resilience pattern that prevents a failing downstream service from taking down the services that depend on it. Named after the electrical circuit breaker that cuts power when current exceeds a safe threshold, it detects when a dependency is failing and “opens” — stopping all requests to that service for a period. This gives the failing service time to recover and prevents the caller from wasting threads waiting for timeouts.

Cascading Failures

In a microservices system, services call each other. Service A calls Service B; B calls C; C calls D. If D becomes slow or unresponsive, every request through the chain waits for D to respond or timeout. Threads in A, B, and C pile up holding open connections to their downstream dependencies. Connection pools exhaust. Memory fills with queued requests. A slow response from one service at the bottom of the call chain takes down the entire stack — a cascading failure.

Without a circuit breaker, even a 5-second timeout on each hop compounds: a user request that touches A → B → C → D can take 15+ seconds before returning an error, by which time hundreds of threads are blocked waiting for a service that’s clearly not responding.

How the Circuit Breaker Works

A circuit breaker wraps every call to an external dependency. It monitors the results of those calls — successes, failures, and timeouts. When the failure rate crosses a threshold, it “opens” the circuit: subsequent calls immediately return an error (or a fallback) without hitting the downstream service. After a configured wait, it allows a small number of test requests through. If those succeed, it “closes” the circuit and resumes normal operation.

Circuit breaker states: Closed (normal), Open (blocking), Half-Open (testing recovery)

The Three States

Closed (normal operation): Requests flow through to the downstream service. The circuit breaker counts successes and failures within a sliding window. If the failure rate stays below the threshold (e.g., <50% of the last 20 calls), the circuit remains closed.

Open (tripped): The failure threshold was exceeded. The circuit breaker immediately rejects all calls — no request reaches the downstream service. The caller receives a fast error or fallback response. This state lasts for a configured duration (e.g., 30 seconds), called the “sleep window.”

Half-Open (recovery probe): After the sleep window expires, the circuit transitions to half-open. A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit re-opens for another sleep window. This probing prevents thundering herds: instead of all callers rushing back the moment a service recovers, only probe requests test the water first.

💡

Fast Failure vs Slow Timeout

An open circuit breaker fails fast — it returns an error in microseconds instead of waiting 5–30 seconds for a timeout. This means threads are freed immediately, connection pools don’t exhaust, and the user gets a degraded-but-fast response rather than a hung UI. Fast failure is almost always preferable to slow failure in distributed systems.

Fallbacks

A circuit breaker without a fallback just turns downstream failures into upstream errors. A good implementation pairs the circuit breaker with a fallback strategy:

Cached response: Return the last known good value from cache. A product recommendation service returning yesterday’s recommendations is better than an error page.
Default value: Return an empty list, a default configuration, or a safe neutral response. An A/B test service returning the control variant when it’s down is benign.
Graceful degradation: Skip the non-essential feature entirely. If the ratings service is down, show products without star ratings rather than failing the page load.
Queue for retry: For write operations, queue the request for background retry when the service recovers. Use the outbox pattern to ensure durability.

Tuning Thresholds

Misconfigured circuit breakers create new problems. Tune based on real traffic patterns:

Parameter	Too low	Too high	Typical starting point
Failure threshold	Opens on transient errors	Opens too late, threads already exhausted	50% of last 20 requests
Minimum request count	Opens on 1 failure in low traffic	Too slow to detect problems	20 requests per window
Sleep window	Half-opens too soon; re-trips immediately	Service recovers but circuit stays open too long	30–60 seconds
Probe count (half-open)	One bad probe re-opens the circuit	Too many probes hit a recovering service	3–5 probe requests

Set the minimum request count high enough that a few failures during low-traffic periods (nights, weekends) don’t trip the circuit. Use a percentage threshold, not an absolute count, so the circuit scales with traffic volume.

Implementations

Resilience4j (Java): The standard circuit breaker library for JVM services. Supports count-based and time-based sliding windows, bulkheads, retry, and rate limiting. Integrates with Spring Boot, Micronaut, and Quarkus.

Hystrix (Java, deprecated): Netflix’s original circuit breaker. Widely referenced but no longer actively maintained. Resilience4j is its successor.

Polly (.NET): Resilience policies for C# — circuit breaker, retry, timeout, bulkhead, and fallback. Composable into a “policy wrap.”

Envoy Proxy / Istio: Circuit breaking at the infrastructure layer. Configured via Envoy’s outlier detection — Envoy tracks error rates per upstream host and ejects unhealthy hosts from the load balancing pool. This applies circuit breaking without changing application code.

AWS App Mesh / GCP Traffic Director: Service mesh solutions that provide circuit breaking, retries, and timeout policies as infrastructure configuration.

Design Considerations

Scope circuit breakers per dependency, not globally. A circuit breaker protecting all outbound calls will open when any dependency fails — blocking calls to healthy services too. Each downstream dependency gets its own circuit breaker instance.
Expose circuit state in health checks. Your service’s /health endpoint should report the state of each circuit breaker. An open circuit to a critical dependency means the service is degraded — surface this so orchestrators and dashboards reflect reality.
Distinguish transient errors from permanent failures. HTTP 503 (Service Unavailable) should count toward the failure threshold. HTTP 404 (Not Found) is a client error — it shouldn’t trip the circuit. Configure which error types count as failures.
Pair with retries carefully. Retries and circuit breakers together can amplify load. A circuit breaker with a 50% threshold and a 3-retry policy means each “failure” generates 3 actual requests. Set retry limits low (1–2 retries max) and don’t retry when the circuit is open.
Alert on open circuits. An open circuit is a production incident in disguise. Even if your fallback is graceful, an open circuit means a dependency is failing. Alert immediately when a circuit opens so on-call engineers can investigate before the fallback’s limits are reached.