Monitoring & Troubleshooting — Cosmos DB

TL;DR

Cosmos's metrics live in three places — the portal's Insights blade for a quick overview, Diagnostics Logs for per-request detail (you must turn this on), and the SDK's per-call Diagnostics for in-app instrumentation. The three signals that matter — **429 throttling rate**, **P99 latency**, and **normalized RU consumption**. Get those on a dashboard before anything else.

Key takeaways

▸Turn on Diagnostics Settings → Send to Log Analytics. Without this, you can't query historical request data.
▸429s aren't errors — they're the SDK retrying. The signal is the *rate* of 429s relative to total requests, and the SDK retry budget.
▸Normalized RU consumption is the % of your provisioned RU/s used over a 1-min window. > 70% sustained is a scale-up signal; spikes to 100% are usually fine.
▸The "Top Queries" view in Insights shows you which queries spend the most RUs. That's where indexing tuning starts.
▸Per-request Diagnostics from the SDK is the gold-standard troubleshooting tool — partition-by-partition cost, retry timeline, network breakdown.

You can run a Cosmos workload for months without monitoring. You’ll never know when it starts to break — only that customers are complaining. Set up monitoring before you ship, and the operational story shifts from forensic to proactive.

The three places metrics live

1. Portal → Insights (free, instant, shallow) A built-in dashboard. Throughput, latency, request count, top queries, top operations. Good for “is anything obviously wrong right now.” Not queryable, not historical past 30 days.

2. Diagnostics Logs (per-request, queryable, historical) You must turn this on — Settings → Diagnostic Settings → Add diagnostic setting → send DataPlaneRequests, QueryRuntimeStatistics, PartitionKeyStatistics to Log Analytics. After that, you can query everything in KQL.

3. SDK Diagnostics (per-call, in-app) Every Cosmos response includes a Diagnostics object — RU breakdown per partition, retry timeline, sub-millisecond timings. Best for live troubleshooting and unit-level insight.

The three signals that matter

429 rate

A 429 means the SDK exceeded your RU budget and is retrying. The metric to alarm on isn’t 429 count — it’s the ratio of 429s to total requests and whether the SDK retry budget is being consumed.

AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Category == "DataPlaneRequests"
| summarize total = count(), throttled = countif(statusCode_s == "429") by bin(TimeGenerated, 1m)
| extend pct = 100.0 * throttled / total

5% sustained = under-provisioned. < 1% = fine, the SDK absorbs it transparently.

P99 latency

P50 hides everything. The user pain is in the tail. For point reads, expect < 10 ms P99 in-region. For queries, < 50 ms single-partition. If P99 walks away from P50, something specific is slow — a hot partition, a missing index, or a cross-partition fan-out.

Normalized RU consumption

This is the capacity signal — 0–100% of your provisioned RU/s, averaged per minute, per physical partition. Important — it’s a max across partitions, not an average. One hot partition at 100% drags this up while every other sits at 5%.

AzureMetrics
| where MetricName == "NormalizedRUConsumption"
| summarize max(Maximum) by bin(TimeGenerated, 1m)

Sustained > 70% → consider autoscale, more RU/s, or partition-key surgery.

Reading SDK Diagnostics

In .NET — response.Diagnostics.ToString() returns a giant JSON object. The fields that matter:

TotalRequestCharge — RUs consumed.
ContactedReplicas — list of partition replicas touched.
RetryContext — every retry attempt, with reason and delay.
ClientSideRequestStatistics → StoreResponseStatistics — per-replica latency and status code.

If a request is slow, paste the Diagnostics into a tool, expand the timeline. You’ll see exactly where time went — DNS, TLS, queueing, backend processing, retries.

Top queries view

Insights → “Top Queries by Average RU.” Sort the leaderboard. The top 10 are typically:

Queries missing the partition key (cross-partition fan-outs)
ORDER BY without a range/composite index
COUNT(*) over many docs
SELECT * on large documents

Each is fixable in lessons V02, V06, V07. Monitoring is what tells you which one to fix first.

What to alert on

Three alerts, no more:

429 rate > 5% for 5 minutes (warning), > 15% (page).
P99 latency > 200ms for 10 minutes for read operations.
Container availability < 99.9% (Cosmos’s published SLA — drops below this only on real incidents).

Skip the rest. Alert fatigue is real, and Cosmos’s SLAs cover most of what you’d otherwise watch for.

🎯 Common questions

Q1. My P99 latency spiked but my P50 is fine. What's the most likely cause? ▾

Cross-partition queries on a small subset of requests, or hot-partition contention. Pull the Diagnostics for the slow requests — if they show many partitions touched, it's a query problem (lesson V06). If they show one partition with high `BackendLatency`, it's a hot key (lesson V02).

Q2. 100% normalized RU consumption — should I be alarmed? ▾

A spike, no — that means you used what you provisioned. Sustained 100% with growing 429 rate, yes — you need autoscale, more RUs, or to find what's misbehaving. The metric to watch is "PhysicalPartitionThroughputInfo" — sometimes one partition is hot while the rest are idle.

Q3. How do I find the actual slow query? ▾

In Log Analytics, query CDBQueryRuntimeStatistics for the time window. Sort by `requestCharge` desc. The top 10 queries usually reveal the culprits — a forgotten cross-partition COUNT, a missing composite index, a SELECT * on a 200 KB document.