Vector Search & AI — Cosmos DB

TL;DR

Cosmos now has native vector indexes — store embeddings in the same container as your operational data and run cosine-similarity search without standing up a separate vector DB. For most RAG-on-your-own-data scenarios, this is the simplest stack — your source of truth, vector index, and metadata filters all live in one place.

Key takeaways

▸Define a `vectorEmbeddingPolicy` and a vector index (`quantizedFlat` or `diskANN`) in your container's index policy.
▸Query with `VectorDistance(c.embedding, @queryVector)` — combine with normal `WHERE` filters in the same query.
▸Pre-filter beats post-filter at scale — use the WHERE clause to narrow before computing distances.
▸DiskANN for accuracy at scale; quantizedFlat for smaller datasets where memory permits and you want exact results.
▸Generating embeddings is your job — Azure OpenAI's `text-embedding-3-small` is the common pick. Cosmos stores and indexes them; it doesn't generate them.

For two years the answer to “where do I store embeddings” was “a separate vector DB.” Cosmos changed that — its native vector indexing brings embedding storage and retrieval into the same container as your operational data. For most teams doing RAG, this is now the simplest path.

What you store

A typical RAG document in Cosmos:

{
  "id": "doc-42",
  "tenantId": "acme",
  "sourceDocId": "policy-handbook.pdf",
  "chunkIndex": 7,
  "text": "Employees may carry over up to 5 vacation days...",
  "embedding": [0.012, -0.045, ..., 0.018],  // 1536 floats for text-embedding-3-small
  "metadata": { "page": 12, "section": "Time Off" }
}

Container partitioned by /tenantId (typical multi-tenant SaaS) or /sourceDocId (single-tenant doc retrieval). Embedding is just a JSON array of floats.

The index policy

{
  "vectorEmbeddingPolicy": {
    "vectorEmbeddings": [
      {
        "path": "/embedding",
        "dataType": "float32",
        "dimensions": 1536,
        "distanceFunction": "cosine"
      }
    ]
  },
  "indexingPolicy": {
    "vectorIndexes": [
      { "path": "/embedding", "type": "diskANN" }
    ],
    "includedPaths": [{ "path": "/*" }],
    "excludedPaths": [{ "path": "/embedding/*" }]
  }
}

Three things to notice:

The path /embedding is declared as a vector with type, dimensions, and distance function. Match the dimensions to your embedding model (1536 for text-embedding-3-small, 3072 for -large).
The vector index type — diskANN for sublinear search at scale (100K+ vectors), quantizedFlat for smaller sets where you want exact results.
Exclude /embedding/* from the regular index — otherwise Cosmos tries to range-index every dimension. That’s pure waste.

Querying

SELECT TOP 5
  c.id, c.text, c.metadata,
  VectorDistance(c.embedding, @queryVector) AS similarity
FROM c
WHERE c.tenantId = @tenant
ORDER BY VectorDistance(c.embedding, @queryVector)

Three notes:

Always include the partition key in WHERE — same rule as ever (lesson V06). Single-tenant tenant filter cuts the search space dramatically.
TOP N + ORDER BY VectorDistance is the canonical shape. Cosmos’s index uses this pattern to return top-K efficiently.
Combine with normal filters — WHERE c.tenantId = X AND c.docType = "policy" ORDER BY VectorDistance(...) works. The index handles pre-filtering.

Pre-filter vs post-filter

Pre-filter — narrow with WHERE before the distance compute. Cheap, scales to billions of docs as long as the filter is selective.

Post-filter — search vectors first, filter results. Misses relevant docs if the filter excludes most of the top-K. Avoid when possible.

The index supports both, but the SQL syntax encourages pre-filter, which is the right default.

Generating embeddings

Cosmos doesn’t generate embeddings — it stores and indexes them. You compute embeddings in your app:

from openai import AzureOpenAI

client = AzureOpenAI(...)
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunk_text,
).data[0].embedding

container.create_item({
    "id": f"{source_id}-{chunk_idx}",
    "tenantId": tenant,
    "sourceDocId": source_id,
    "text": chunk_text,
    "embedding": emb,
})

For ongoing ingestion — combine this with Change Feed (lesson V11). Source docs land in a documents container; a Change Feed processor chunks, embeds, and writes to the embeddings container. New content searchable within seconds.

Cost economics

Embedding storage — vectors are large. 1536-dim float32 = ~6 KB per doc. 1M chunks = ~6 GB. Storage cost is real but bounded.

Query cost — DiskANN searches are sub-linear. Typical cost — 5–20 RUs per top-K query, scaling more with K than with corpus size. Compare to a dedicated vector DB’s per-query pricing — Cosmos is competitive at most scales.

Index build — initial index of 1M vectors takes ~10 min on a 4000 RU/s container. Costs ~50K RUs (one-time).

When this beats a dedicated vector DB

You’re already on Cosmos for the operational data
You want metadata filters in the same query as the vector search
You don’t have 100M+ vectors with sub-100ms latency requirements
You don’t want to operate one more thing

When a dedicated vector DB still wins — extreme scale, hybrid sparse+dense retrieval, complex re-ranking pipelines, very tight latency budgets at high QPS.

🎯 Common questions

Q1. When would I still pick a dedicated vector DB over Cosmos? ▾

At extreme scale (100M+ vectors) and extreme query volume where a specialized engine like Pinecone, Weaviate, or Azure AI Search Vector index gives better cost-per-query. Also, when your existing app already uses one. For 95% of "RAG on company docs / customer data" scenarios, Cosmos's native vectors are good enough and one less moving part.

Q2. How big should my chunks be? ▾

200–500 tokens for technical docs, 500–1000 for narrative content. Smaller chunks → tighter retrieval but more chunks per question; larger → fewer chunks but might miss precision. Test with your real questions and tune. Store the chunk text *and* its embedding in the same Cosmos document — same container, same partition key per source document.

Q3. Can I update an embedding without rewriting the whole doc? ▾

With Patch (lesson V05), yes — update just the `embedding` field. Cheaper RU and avoids the read-modify-write race. Useful when you re-embed periodically with a newer model.