AI Systems Engineer

The full stack of building serious AI systems: how LLMs serve at scale, how retrieval keeps them grounded, and how agents and tool-use turn them into actual products.

3 modules · 6 concepts · ~90 min

▶ Start with LLM Inference at Scale

What you'll be able to do at the end

✓ Pick the right inference optimization (KV cache, continuous batching, speculative decoding) for your workload.
✓ Design a RAG pipeline that cites sources and survives a 10× traffic spike.
✓ Decide when an agent should plan vs act vs critique, and how to keep it from looping.
✓ Sketch the AI gateway layer your platform team will eventually build.

The path

3 modules · in order

MODULE 01

Foundations of LLM Serving

How a single prompt becomes a streaming response — and what makes it fast or slow.

⚡ 15 MIN

LLM Inference at Scale

KV cache, continuous batching, speculative decoding — what actually makes ChatGPT fast.

→

🌐 15 MIN

AI Gateway

The traffic-cop in front of your LLM: routing, caching, fallbacks, rate limits, observability.

→

MODULE 02

Retrieval & Memory

How models stop hallucinating: the vector store, the retriever, and the prompt assembly.

📐 15 MIN

Vector DB Internals

HNSW, IVF, and product quantization — how databases search billions of embeddings in milliseconds.

→

🔎 15 MIN

RAG Pipeline

Retrieval-augmented generation: how chatbots cite sources without hallucinating.

→

MODULE 03

Tools & Agents

When the model decides to call code instead of just predicting words.

🔧 15 MIN

Function Calling & Tool Use

How LLMs decide when to call APIs, the schemas they emit, and the round-trip back to natural language.

→

🤖 15 MIN

Agentic Workflows

Multi-agent orchestration: planner, executor, critic — and how they coordinate without falling over.

→