A Circuit Breaker wraps a remote call and tracks recent failures. When failures cross a threshold, it "opens" — every subsequent call fails *immediately* without hitting the dependency, for a cooldown period. After cooldown it goes "half-open" — letting one trial call through. Success closes it; failure opens it again. This pattern prevents cascading failures, kills retry storms, and gives a struggling downstream room to recover.
In front of any remote dependency that can fail or slow down — databases, external APIs, payment processors, downstream microservices, third-party LLMs. Especially critical when many requests share a small set of dependencies — without a breaker, one slow downstream exhausts your thread pool and you go down too.
See it
Simulate it
Why this pattern is the resilience primitive
In a microservice fleet, every service has dependencies — databases, caches, other services, third-party APIs. When one of those dependencies starts failing or slowing down, the failure spreads in a predictable, awful pattern:
- Calls to the bad dependency start timing out (say, 30s).
- Those threads in the calling service are stuck waiting.
- New incoming requests pile up because there are no free threads.
- The calling service starts rejecting requests or running out of memory.
- Its callers now see it as the bad dependency. The pattern repeats up the call chain.
By failing fast during the dependency’s bad period, the calling service:
- Frees up threads (no more 30s timeouts holding them hostage).
- Gives the bad dependency fewer requests to chew on.
- Lets the calling service stay responsive to other (healthy) work.
The three-state machine
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure rate ≥ threshold<br/>(e.g. 50% of last 20 calls)
Open --> HalfOpen: cooldown elapsed<br/>(e.g. 10s)
HalfOpen --> Closed: trial calls succeed
HalfOpen --> Open: trial calls fail<br/>(reset cooldown, possibly longer)
note right of Closed
Calls pass through.
Failures counted in
sliding window.
end note
note right of Open
Calls fail fast.
No actual call made.
Fallback returned.
end note
note right of HalfOpen
Limited trial calls
(e.g. 3) test recovery
without flooding.
end note
Why retries make things worse — without a breaker
flowchart TD
subgraph Without [Without circuit breaker]
D1[Bad downstream] -->|slow / fails| C1[1000 callers]
C1 -->|3× retries each| D1
D1 -->|now 3000 reqs/s<br/>was 1000| Dead[Total outage]
end
subgraph With [With circuit breaker]
D2[Bad downstream] -->|fails| C2[1000 callers]
C2 -->|breaker opens| Fast[Fail fast<br/>locally]
Fast -.no traffic.-> D2
D2 -->|recovery time| Heal[Recovers]
end
style D1 fill:#7e1d1d,stroke:#ef4444,color:#fff
style C1 fill:#1c2333,stroke:#475569,color:#e7eaf1
style Dead fill:#7f1d1d,stroke:#f43f5e,color:#fff
style D2 fill:#9a3412,stroke:#f97316,color:#fff
style C2 fill:#1c2333,stroke:#475569,color:#e7eaf1
style Fast fill:#0e7490,stroke:#06b6d4,color:#fff
style Heal fill:#365314,stroke:#84cc16,color:#fff
Configuration shape (Resilience4j-style)
slidingWindowSize: 20 # last 20 calls
minimumNumberOfCalls: 10 # need this many before tripping
failureRateThreshold: 50 # open when ≥50% failures
waitDurationInOpenState: 10s # cooldown
permittedNumberOfCallsInHalfOpenState: 3
slowCallDurationThreshold: 2s # treat slow as a failure
slowCallRateThreshold: 50
Where the breaker lives
flowchart LR
Caller[Calling service]
subgraph A [Option A: in-process library]
L[Resilience4j / Polly<br/><i>per call site</i>]
end
subgraph B [Option B: sidecar / service mesh]
SC[Istio / Linkerd / Envoy<br/><i>outlier detection</i>]
end
subgraph C [Option C: API gateway]
GW[Kong / NGINX / AWS API GW<br/><i>at the edge</i>]
end
Caller -.A: app code.-> L --> Down[Downstream]
Caller -.B: language-agnostic.-> SC --> Down
Caller -.C: edge boundary.-> GW --> Down
style Caller fill:#1c2333,stroke:#475569,color:#e7eaf1
style L fill:#0e7490,stroke:#06b6d4,color:#fff
style SC fill:#1e3a8a,stroke:#3b82f6,color:#fff
style GW fill:#581c87,stroke:#a855f7,color:#fff
style Down fill:#9a3412,stroke:#f97316,color:#fff
Real systems
- Netflix Hystrix — the pattern’s poster child, now retired in favor of Resilience4j.
- Istio / Envoy — outlier detection ejects unhealthy upstreams from the load-balancer pool, which is a distributed Circuit Breaker.
- AWS SDKs — adaptive retry mode includes per-host circuit-breaking under the hood.
- Kubernetes readiness probes — pods marked “not ready” are removed from service endpoints; same idea, different scope.
Fallbacks — the natural pairing
A fallback is what you return when the breaker is open. Not part of Circuit Breaker proper, but pairs perfectly:
- Cached value — last known good response
- Default response — empty list, “trending now” instead of personalized
- Degraded experience — read-only mode, “service unavailable” with retry hint
Trade-offs
Pros: Prevents cascading failures; lets struggling dependencies recover; keeps the calling service responsive; pairs perfectly with retries-with-backoff.
Comments 0
Discuss this page. Markdown supported. Be kind.