Circuit Breaker — System Design

TL;DR

A Circuit Breaker wraps a remote call and tracks recent failures. When failures cross a threshold, it "opens" — every subsequent call fails *immediately* without hitting the dependency, for a cooldown period. After cooldown it goes "half-open" — letting one trial call through. Success closes it; failure opens it again. This pattern prevents cascading failures, kills retry storms, and gives a struggling downstream room to recover.

When to use

In front of any remote dependency that can fail or slow down — databases, external APIs, payment processors, downstream microservices, third-party LLMs. Especially critical when many requests share a small set of dependencies — without a breaker, one slow downstream exhausts your thread pool and you go down too.

See it

Simulate it

Why this pattern is the resilience primitive

In a microservice fleet, every service has dependencies — databases, caches, other services, third-party APIs. When one of those dependencies starts failing or slowing down, the failure spreads in a predictable, awful pattern:

Calls to the bad dependency start timing out (say, 30s).
Those threads in the calling service are stuck waiting.
New incoming requests pile up because there are no free threads.
The calling service starts rejecting requests or running out of memory.
Its callers now see it as the bad dependency. The pattern repeats up the call chain.

By failing fast during the dependency’s bad period, the calling service:

Frees up threads (no more 30s timeouts holding them hostage).
Gives the bad dependency fewer requests to chew on.
Lets the calling service stay responsive to other (healthy) work.

The three-state machine

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate ≥ threshold<br/>(e.g. 50% of last 20 calls)
    Open --> HalfOpen: cooldown elapsed<br/>(e.g. 10s)
    HalfOpen --> Closed: trial calls succeed
    HalfOpen --> Open: trial calls fail<br/>(reset cooldown, possibly longer)

    note right of Closed
        Calls pass through.
        Failures counted in
        sliding window.
    end note
    note right of Open
        Calls fail fast.
        No actual call made.
        Fallback returned.
    end note
    note right of HalfOpen
        Limited trial calls
        (e.g. 3) test recovery
        without flooding.
    end note

Why retries make things worse — without a breaker

flowchart TD
    subgraph Without [Without circuit breaker]
        D1[Bad downstream] -->|slow / fails| C1[1000 callers]
        C1 -->|3× retries each| D1
        D1 -->|now 3000 reqs/s<br/>was 1000| Dead[Total outage]
    end

    subgraph With [With circuit breaker]
        D2[Bad downstream] -->|fails| C2[1000 callers]
        C2 -->|breaker opens| Fast[Fail fast<br/>locally]
        Fast -.no traffic.-> D2
        D2 -->|recovery time| Heal[Recovers]
    end

    style D1 fill:#7e1d1d,stroke:#ef4444,color:#fff
    style C1 fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Dead fill:#7f1d1d,stroke:#f43f5e,color:#fff
    style D2 fill:#9a3412,stroke:#f97316,color:#fff
    style C2 fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Fast fill:#0e7490,stroke:#06b6d4,color:#fff
    style Heal fill:#365314,stroke:#84cc16,color:#fff

Configuration shape (Resilience4j-style)

slidingWindowSize: 20             # last 20 calls
minimumNumberOfCalls: 10          # need this many before tripping
failureRateThreshold: 50          # open when ≥50% failures
waitDurationInOpenState: 10s      # cooldown
permittedNumberOfCallsInHalfOpenState: 3
slowCallDurationThreshold: 2s     # treat slow as a failure
slowCallRateThreshold: 50

Where the breaker lives

flowchart LR
    Caller[Calling service]

    subgraph A [Option A: in-process library]
        L[Resilience4j / Polly<br/><i>per call site</i>]
    end

    subgraph B [Option B: sidecar / service mesh]
        SC[Istio / Linkerd / Envoy<br/><i>outlier detection</i>]
    end

    subgraph C [Option C: API gateway]
        GW[Kong / NGINX / AWS API GW<br/><i>at the edge</i>]
    end

    Caller -.A: app code.-> L --> Down[Downstream]
    Caller -.B: language-agnostic.-> SC --> Down
    Caller -.C: edge boundary.-> GW --> Down

    style Caller fill:#1c2333,stroke:#475569,color:#e7eaf1
    style L fill:#0e7490,stroke:#06b6d4,color:#fff
    style SC fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style GW fill:#581c87,stroke:#a855f7,color:#fff
    style Down fill:#9a3412,stroke:#f97316,color:#fff

Real systems

Netflix Hystrix — the pattern’s poster child, now retired in favor of Resilience4j.
Istio / Envoy — outlier detection ejects unhealthy upstreams from the load-balancer pool, which is a distributed Circuit Breaker.
AWS SDKs — adaptive retry mode includes per-host circuit-breaking under the hood.
Kubernetes readiness probes — pods marked “not ready” are removed from service endpoints; same idea, different scope.

Fallbacks — the natural pairing

A fallback is what you return when the breaker is open. Not part of Circuit Breaker proper, but pairs perfectly:

Cached value — last known good response
Default response — empty list, “trending now” instead of personalized
Degraded experience — read-only mode, “service unavailable” with retry hint

Trade-offs

Pros: Prevents cascading failures; lets struggling dependencies recover; keeps the calling service responsive; pairs perfectly with retries-with-backoff.