← System Design
System Design

Circuit Breaker

Stop calling a failing dependency before it takes you down with it. Closed → Open → Half-Open.

TL;DR

A Circuit Breaker wraps a remote call and tracks recent failures. When failures cross a threshold, it "opens" — every subsequent call fails *immediately* without hitting the dependency, for a cooldown period. After cooldown it goes "half-open" — letting one trial call through. Success closes it; failure opens it again. This pattern prevents cascading failures, kills retry storms, and gives a struggling downstream room to recover.

When to use

In front of any remote dependency that can fail or slow down — databases, external APIs, payment processors, downstream microservices, third-party LLMs. Especially critical when many requests share a small set of dependencies — without a breaker, one slow downstream exhausts your thread pool and you go down too.

See it

Simulate it

Why this pattern is the resilience primitive

In a microservice fleet, every service has dependencies — databases, caches, other services, third-party APIs. When one of those dependencies starts failing or slowing down, the failure spreads in a predictable, awful pattern:

  1. Calls to the bad dependency start timing out (say, 30s).
  2. Those threads in the calling service are stuck waiting.
  3. New incoming requests pile up because there are no free threads.
  4. The calling service starts rejecting requests or running out of memory.
  5. Its callers now see it as the bad dependency. The pattern repeats up the call chain.

By failing fast during the dependency’s bad period, the calling service:

  • Frees up threads (no more 30s timeouts holding them hostage).
  • Gives the bad dependency fewer requests to chew on.
  • Lets the calling service stay responsive to other (healthy) work.

The three-state machine

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate ≥ threshold<br/>(e.g. 50% of last 20 calls)
    Open --> HalfOpen: cooldown elapsed<br/>(e.g. 10s)
    HalfOpen --> Closed: trial calls succeed
    HalfOpen --> Open: trial calls fail<br/>(reset cooldown, possibly longer)

    note right of Closed
        Calls pass through.
        Failures counted in
        sliding window.
    end note
    note right of Open
        Calls fail fast.
        No actual call made.
        Fallback returned.
    end note
    note right of HalfOpen
        Limited trial calls
        (e.g. 3) test recovery
        without flooding.
    end note

Why retries make things worse — without a breaker

flowchart TD
    subgraph Without [Without circuit breaker]
        D1[Bad downstream] -->|slow / fails| C1[1000 callers]
        C1 -->|3× retries each| D1
        D1 -->|now 3000 reqs/s<br/>was 1000| Dead[Total outage]
    end

    subgraph With [With circuit breaker]
        D2[Bad downstream] -->|fails| C2[1000 callers]
        C2 -->|breaker opens| Fast[Fail fast<br/>locally]
        Fast -.no traffic.-> D2
        D2 -->|recovery time| Heal[Recovers]
    end

    style D1 fill:#7e1d1d,stroke:#ef4444,color:#fff
    style C1 fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Dead fill:#7f1d1d,stroke:#f43f5e,color:#fff
    style D2 fill:#9a3412,stroke:#f97316,color:#fff
    style C2 fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Fast fill:#0e7490,stroke:#06b6d4,color:#fff
    style Heal fill:#365314,stroke:#84cc16,color:#fff

Configuration shape (Resilience4j-style)

slidingWindowSize: 20             # last 20 calls
minimumNumberOfCalls: 10          # need this many before tripping
failureRateThreshold: 50          # open when ≥50% failures
waitDurationInOpenState: 10s      # cooldown
permittedNumberOfCallsInHalfOpenState: 3
slowCallDurationThreshold: 2s     # treat slow as a failure
slowCallRateThreshold: 50

Where the breaker lives

flowchart LR
    Caller[Calling service]

    subgraph A [Option A: in-process library]
        L[Resilience4j / Polly<br/><i>per call site</i>]
    end

    subgraph B [Option B: sidecar / service mesh]
        SC[Istio / Linkerd / Envoy<br/><i>outlier detection</i>]
    end

    subgraph C [Option C: API gateway]
        GW[Kong / NGINX / AWS API GW<br/><i>at the edge</i>]
    end

    Caller -.A: app code.-> L --> Down[Downstream]
    Caller -.B: language-agnostic.-> SC --> Down
    Caller -.C: edge boundary.-> GW --> Down

    style Caller fill:#1c2333,stroke:#475569,color:#e7eaf1
    style L fill:#0e7490,stroke:#06b6d4,color:#fff
    style SC fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style GW fill:#581c87,stroke:#a855f7,color:#fff
    style Down fill:#9a3412,stroke:#f97316,color:#fff

Real systems

  • Netflix Hystrix — the pattern’s poster child, now retired in favor of Resilience4j.
  • Istio / Envoy — outlier detection ejects unhealthy upstreams from the load-balancer pool, which is a distributed Circuit Breaker.
  • AWS SDKs — adaptive retry mode includes per-host circuit-breaking under the hood.
  • Kubernetes readiness probes — pods marked “not ready” are removed from service endpoints; same idea, different scope.

Fallbacks — the natural pairing

A fallback is what you return when the breaker is open. Not part of Circuit Breaker proper, but pairs perfectly:

  • Cached value — last known good response
  • Default response — empty list, “trending now” instead of personalized
  • Degraded experience — read-only mode, “service unavailable” with retry hint

Trade-offs

Pros: Prevents cascading failures; lets struggling dependencies recover; keeps the calling service responsive; pairs perfectly with retries-with-backoff.

📺 Video

A YouTube walkthrough — coming once the deep-dive video is published.

💻 Code

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions
Q1. Why not just retry the failed call?

Retries *amplify* load on a struggling dependency. If 1000 callers each retry 3× when the downstream is dying, you've sent 3000 extra requests at exactly the wrong time — turning a partial outage into a total one. Circuit Breaker does the opposite — when failure is high, it sends *fewer* requests, giving the downstream room to recover. Retries and Circuit Breakers complement each other — retry on transient failures, but only when the breaker is closed.

Q2. Walk me through the three states.

Closed — calls pass through normally; failures are counted in a sliding window. Open — calls fail fast (no actual call made) for a cooldown period. Half-Open — one (or a small number of) trial call(s) allowed; if they succeed the breaker returns to Closed, if they fail it returns to Open with a possibly longer cooldown. The Half-Open state is what makes recovery automatic instead of requiring a human to flip a switch.

Q3. How do you set the failure threshold?

Two common shapes — (1) consecutive failures (open after N failures in a row), (2) failure rate over a sliding window (open after >50% failures in last 20 calls). Rate-based is more robust — a healthy dependency at 30% load can have momentary blips that aren't worth tripping. Combine with a *minimum request count* so a single failed call out of one doesn't trip a 100% failure rate.

Q4. What's a fallback, and is it part of Circuit Breaker?

A fallback is what you return when the breaker is open. It's a separate pattern that pairs naturally — return a cached value, a default response, or a degraded experience instead of an error. Hystrix popularized "Circuit Breaker + Fallback" together; Netflix uses fallbacks aggressively (e.g. "trending now" if personalized recommendations time out).

Q5. Where does the breaker live in a microservice?

Three options — (1) per-call site in the calling service (Resilience4j, Polly), (2) in a sidecar/service mesh (Istio, Linkerd), (3) in an API gateway in front of the downstream. Sidecar/mesh is the modern default — language-agnostic, centrally configured, doesn't pollute app code. App-level is best when the fallback logic is rich and business-specific.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…
Loading comments…