← System Design

💳

System Design

Payment System

How a tap on Pay travels through gateways, fraud checks, and settlement in under a second.

TL;DR

A modern payment is a tightly orchestrated dance across a Gateway, an idempotency key in Redis, a PSP like Stripe, a state machine driven by webhooks, a double-entry Ledger, a database write, a Kafka event, and a nightly reconciliation job — all sub-second, all exactly once.

When to use

Use this architecture any time you're moving money or any "must-not-double-execute" operation — refunds, transfers, subscription billing, in-app purchases, even point-redemption systems. The same primitives (idempotency keys, webhooks, ledgers, reconciliation) apply whenever **at-most-once** semantics matter more than throughput.

The full story

You tap Pay on your phone. Within a second your screen shows a green tick. But behind that single tap, eight independent systems just coordinated to move money exactly once, audit it permanently, and keep your phone, the merchant, the bank, and a dozen downstream services in sync.

Here’s what actually happens.

The full architecture at a glance

flowchart TD
    U[Buyer's app] -->|POST /payments<br/>+ idempotency key| GW[API Gateway<br/><i>auth · rate-limit</i>]
    GW --> PS[Payment Service]
    PS <-->|"check / SET<br/>idempotency key"| Redis[(Redis<br/>idempotency cache<br/>24h TTL)]
    PS -->|"charge"| PSP[PSP<br/>Stripe / Adyen / Razorpay]
    PSP -.async.-> WH[Webhook receiver]
    WH --> PS
    PS --> DB[(Database<br/>payments table +<br/>ledger entries +<br/>outbox)]
    DB -.poll.-> OBP[Outbox publisher]
    OBP -->|payment.completed| K{{Kafka}}
    K --> N[Notifications]
    K --> F[Fraud]
    K --> A[Accounting]
    K --> An[Analytics]
    K --> L[Loyalty]
    K --> M[Merchant dash]
    Sweep[Stuck-state sweeper] -.poll PSP.-> PS
    Recon[Nightly reconciliation] -.compare.-> DB
    Recon -.compare.-> PSP

    style U fill:#1c2333,stroke:#475569,color:#e7eaf1
    style GW fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style PS fill:#0e7490,stroke:#06b6d4,color:#fff
    style Redis fill:#7e1d1d,stroke:#ef4444,color:#fff
    style PSP fill:#581c87,stroke:#a855f7,color:#fff
    style WH fill:#581c87,stroke:#a855f7,color:#fff
    style DB fill:#0f1320,stroke:#475569,color:#cdd3df
    style OBP fill:#9a3412,stroke:#f97316,color:#fff
    style K fill:#9a3412,stroke:#f97316,color:#fff
    style N fill:#365314,stroke:#84cc16,color:#fff
    style F fill:#365314,stroke:#84cc16,color:#fff
    style A fill:#365314,stroke:#84cc16,color:#fff
    style An fill:#365314,stroke:#84cc16,color:#fff
    style L fill:#365314,stroke:#84cc16,color:#fff
    style M fill:#365314,stroke:#84cc16,color:#fff
    style Sweep fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style Recon fill:#1e3a8a,stroke:#3b82f6,color:#fff

1. The request crosses the front door

The buyer’s app sends POST /payments to your API Gateway. Three things in that request matter more than the payment amount:

A bearer token identifying the buyer.
An idempotency key — a UUID generated on the client device.
The PSP-tokenized card (a reference, never the raw PAN).

The gateway authenticates, rate-limits the buyer, and forwards to the Payment Service. So far this looks like any HTTP request. The first interesting thing is the idempotency key.

2. The idempotency check — Redis as the bouncer

Before doing anything expensive, the Payment Service asks Redis a single question:

Three outcomes:

Redis says	Meaning	Action
`nil`	Fresh request	`SET key "PROCESSING"` with a 24h TTL, proceed
`"PROCESSING"`	Same request currently in flight	Wait briefly, then return 409 or poll
Result blob	Already completed	Return the cached response immediately

3. The actual money movement — calling the PSP

Now the request goes to the Payment Service Provider — Stripe, Razorpay, Adyen, Braintree. This is the only step that touches real money. Everything else is bookkeeping.

The PSP call is the slowest and most failure-prone step in the whole flow:

It spans the public internet
It might involve 3D Secure (an extra browser redirect)
It can be queued behind issuer bank APIs
It can return after 60+ seconds
It can succeed at the bank but fail the response

4. The state machine and the webhook

Every payment moves through a strict state machine:

stateDiagram-v2
    [*] --> INITIATED
    INITIATED --> PROCESSING: PSP accepts
    PROCESSING --> AUTHORIZED: 3DS / issuer OK
    AUTHORIZED --> CAPTURED: capture call (sync or auto)
    CAPTURED --> SETTLED: end-of-day settlement
    INITIATED --> FAILED: PSP rejects
    PROCESSING --> FAILED: timeout / decline
    AUTHORIZED --> VOIDED: cancel before capture
    CAPTURED --> REFUNDED: refund call
    FAILED --> [*]
    VOIDED --> [*]
    REFUNDED --> [*]
    SETTLED --> [*]

The PSP doesn’t always tell you the final state in the synchronous response. Many cards (especially with 3DS) only resolve asynchronously via a webhook — POST /webhooks/psp with the final outcome.

When the webhook arrives, the Payment Service:

Verifies the signature (HMAC of the body using the PSP’s secret).
Idempotently advances the state — replaying a webhook never re-moves the state machine.
Writes the new state to the DB and emits an event.

5. The double-entry ledger

While the synchronous flow is happening, the Ledger is doing something accountants would recognize from 700 years ago. Every successful payment produces two equal-and-opposite postings:

Account	Debit	Credit
Buyer wallet	$20.00
Merchant wallet		$20.00

This isn’t redundant with the DB write — it’s the truth source for money. A balance row in your DB can drift, get corrupted, or be silently wrong after a bug.

6. The DB write and the Kafka event

The Payment Service now does a transactional write:

BEGIN;
  INSERT INTO payments (id, status, amount, ...);
  INSERT INTO ledger_entries (..., debit, ...);
  INSERT INTO ledger_entries (..., credit, ...);
  INSERT INTO outbox (event_type, payload);
COMMIT;

That last outbox insert is the transactional outbox pattern. A separate poller reads outbox rows and publishes them to Kafka.

7. The fan-out via Kafka

Once payment.completed lands in Kafka, every downstream service reacts in parallel:

Notifications — SMS / email / push to buyer and merchant
Fraud — async review of the transaction in context with others
Accounting — adds to the daily settlement file
Analytics — feeds dashboards and ML training pipelines
Loyalty / rewards — credits points
Merchant dashboard — pushes a real-time row update

8. The reconciliation job

Every night, a batch job pulls the previous day’s transactions from the PSP’s reporting API and compares them line-by-line against the local DB. Three buckets fall out:

Matched → archive, done
In PSP, not in our DB → we lost a webhook and the sweeper missed it (rare but real). Replay.
In our DB, not in PSP → we wrote a CAPTURED that the PSP doesn’t have. This is the scary one. Likely a bug. Page on-call.

Why every piece is non-negotiable

A common pushback when designing this for the first time is: “Do I really need all eight services? Can’t I just call Stripe and write a row?”

You can. And you’ll be fine until one of these happens:

The architecture isn’t there because it’s pretty. It’s there because each piece neutralizes a specific class of failure that will happen at scale.

What changes at higher scale

The pattern stays the same. What scales:

Redis becomes a sharded cluster with per-shard hot keys
The Payment Service becomes regionally partitioned by buyer
The Ledger is sharded by account ID, with a global daily settlement
Kafka becomes multi-cluster with mirrormaker for cross-region failover
Reconciliation runs hourly instead of nightly, with continuous streaming reconciliation on top

🧪 Simulator soon

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. What's an idempotency key and why is it critical for payments? ▾

A unique client-generated ID attached to a payment request. The gateway stores `(key → result)` in a low-TTL store like Redis. If a retry arrives with the same key, the server returns the original outcome instead of charging again. It's the single most important defense against double charges in the face of network retries.

Q2. How do you handle a webhook that never arrives? ▾

Don't trust the webhook as the only signal. Run a reconciliation job that periodically polls the PSP's API and compares against your database, plus a "stuck-state" sweeper that re-queries any payment that's been in PENDING for more than N minutes. The webhook is a fast path; reconciliation is the truth.

Q3. Why use a double-entry ledger instead of just updating wallet balances? ▾

A balance row can be wrong silently. Double-entry forces every event to produce two equal-and-opposite postings, which means you can audit any balance at any point in time by replaying the journal. It's also the only way to detect bugs that would otherwise corrupt money.

Q4. Where can the system fail and how do you recover? ▾

Common failure modes — (1) PSP returns 500, retry with the same idempotency key, (2) PSP succeeded but webhook lost, reconciliation catches it, (3) DB write fails after PSP success, the payment stays PENDING and the sweeper retries the DB write, (4) Kafka event lost, downstream consumers re-derive from a periodic snapshot of the DB.

Q5. What's the role of Kafka here? Why not call services synchronously? ▾

Decoupling. Notifications, fraud scoring, accounting, analytics, and the merchant dashboard all want to react to a successful payment. If you call them synchronously, any one of them being slow or down makes the user wait — or worse, fails the payment after the money already moved. Kafka turns it into "fire and forget after the DB commits", which is the only safe boundary.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…