Building with LLMs is mostly context engineering and evaluation, not model training. The model is a fixed black box; your job is to get the right tokens into the prompt, constrain the output, measure quality honestly, defend against injection, and keep cost/latency sane. This sheet covers the full applied stack — the capability ladder, prompting, RAG internals, agents, context limits, evaluation, safety, and LLMOps.
1. The capability ladder (cheapest first)
- Prompt engineering — instructions, examples, format. Always start here; instant, free.
- RAG — inject retrieved knowledge for facts/freshness/private data/citations.
- Tools / function calling — let the model act (search, code, APIs, DB).
- Agents — multi-step reason→act loops with memory + tools for open-ended tasks.
- Fine-tuning — last resort, for style/format/latency, not for facts.
don't fine-tune for knowledge
Fine-tuning teaches behaviour/format, not facts — and it bakes them in stale. For knowledge that
changes or must be cited, use RAG. Reach for fine-tuning to fix tone, structure, or to compress a
long prompt into the weights for latency/cost. Climb the ladder only when the cheaper rung fails.
2. Prompting that holds up
- Be explicit: role, task, constraints, output format. Show 1–3 examples (few-shot) for tricky formats; zero-shot for simple ones.
- Chain-of-thought / reasoning for multi-step problems. With reasoning models, ask for the answer and let them think internally; don't double-prompt CoT.
- Structured output — JSON-schema / grammar-constrained decoding so downstream parsing never breaks.
- Decomposition — break complex tasks into steps/sub-prompts; self-consistency (sample n, vote) for hard reasoning.
- Order for caching: put stable content (system prompt, instructions, few-shot) first (prefix-cacheable), variable content last.
3. Embeddings & vector search
- Embedding — dense vector where semantic similarity ≈ closeness (cosine/dot). Pick a model sized to your latency/quality (and language/domain); normalize for cosine.
- ANN index — exact search doesn't scale, so use approximate nearest neighbour: HNSW (graph, great recall/latency, more memory) or IVF-PQ (clustered + compressed, memory-efficient). Tune
ef_search/nprobefor recall vs speed. - Vector stores — pgvector (boring, correct default), Qdrant, Weaviate, Milvus, FAISS (local/in-process), Pinecone (managed).
- Metadata filtering — combine vector similarity with structured filters (tenant, date, source) — essential for multi-tenant and freshness.
4. RAG pipeline (in depth)
OFFLINE: ingest → chunk → embed → index (+ metadata)
ONLINE: query → [rewrite] → embed → retrieve (dense + BM25)
→ rerank (cross-encoder) → pack (budget) → generate → cite
EVAL: retrieval (recall@k, ctx precision) + generation (faithfulness)
- Chunking — structure/semantic-aware beats fixed size; self-contained chunks, overlap, metadata. Consider small-to-big / parent-document retrieval.
- Hybrid retrieval — dense (semantic) + sparse (BM25 keyword) fused (e.g. reciprocal rank fusion). Catches both meaning and exact terms/codes.
- Reranking — a cross-encoder scores (query, doc) jointly; reorder top-k (50→5). Retrieve-then-rerank consistently beats pure vector.
- Query transforms — rewriting, multi-query, HyDE (hypothetical doc embedding), step-back.
- GraphRAG — build an entity/relation graph for global/connected questions that chunk retrieval misses.
retrieval quality is the ceiling
A great model can't answer from bad context. Most "the LLM is dumb" bugs are retrieval bugs — wrong
chunks, no reranker, chunk too big/small, missing metadata filter. Evaluate retrieval (recall@k, hit
rate) separately from generation so you fix the real bottleneck.
5. Agents & tools
- Loop (ReAct): reason → choose tool → execute → observe result → repeat until done.
- Function calling — tools defined as JSON schemas; model emits a structured call, you run it and feed the result back.
- MCP (Model Context Protocol) — open standard connecting models to tools/data via a uniform client-server interface; the emerging integration layer ("USB-C for tools").
- Patterns — single-agent + tools (start here), planner/executor, reflection/critique, multi-agent (only when justified — coordination cost is real).
- Memory — short-term (conversation window/summary) + long-term (vector store of past facts).
- Bound everything — step limits, tool allow-lists, timeouts, cost caps, human-in-the-loop for risky actions.
- Frameworks: LangGraph (graphs/state), OpenAI Agents SDK, CrewAI, LlamaIndex, or a plain function-calling loop. Use the simplest that works.
6. Context window management
- Long context ≠ free: cost + latency grow with tokens, and models suffer "lost in the middle" — info mid-prompt gets under-used. Put critical content at the start or end.
- Retrieve less but better (strong reranking) instead of stuffing everything.
- Compress/summarize history; trim with windowing or rolling summaries; track a token budget per request.
- Test the same fact at different prompt positions to measure your model's effective usable context.
7. Evaluation (the part everyone skips)
- Build a golden eval set from real queries early. You can't improve what you don't measure.
- LLM-as-judge for open-ended quality — calibrate against human labels, watch for position/verbosity bias; use pairwise or rubric scoring.
- Deterministic checks where possible — exact match, schema validation, unit tests on tool outputs.
- RAG metrics (RAGAS-style): faithfulness (grounded in context?), answer relevance, context precision/recall.
- Run evals in CI on every prompt/model/index change — prompts are code and regress silently. Track cost + latency alongside quality.
- Tools: RAGAS, promptfoo, LangSmith, Braintrust, OpenAI Evals, DeepEval.
8. Safety & reliability
- Prompt injection is the top risk — untrusted content (retrieved docs, tool output, user input, fetched pages) can hijack instructions. Indirect injection via RAG/web data is especially dangerous in acting agents.
- Mitigations (defense in depth): separate data from instructions (clear delimiters), never execute retrieved text as commands, constrain tool permissions (least privilege), validate inputs/outputs, guardrail models/classifiers, human-in-the-loop for risky actions. Put real authorization in code, not the prompt.
- Hallucination control — ground with RAG, require citations, allow "I don't know", constrain output, verify with a second pass, lower temperature for facts.
- Guardrails — input/output validation, PII redaction, content filters, schema validation, jailbreak detection.
- Data privacy — PII handling, tenant isolation in the vector store, retention/zero-retention with the provider.
9. LLMOps — cost, latency, observability
- Cache — exact-match, semantic (near-duplicate queries), and prompt-prefix caching for shared system prompts/few-shot/RAG headers.
- Route / cascade — small cheap model for easy turns, big model only for hard ones; classify difficulty up front.
- Token discipline — trim history, retrieve fewer chunks, cap max output tokens, structured output to stop rambling.
- Stream tokens for perceived latency; measure TTFT + cost/request + tokens/request.
- Trace every call (prompt, retrieved context, tool calls, tokens, latency, cost) — you can't debug or cost-optimize a black box without traces.
- Govern — per-user budgets/rate limits; bound agent loops so they can't burn unbounded tokens; fallback/provider failover.
10. Common architectures
| Need | Pattern |
|---|---|
| Answer over private docs | RAG (hybrid + rerank) with citations |
| Exact data / transactions | Tool/function calling to APIs/SQL, not retrieval |
| Multi-step open task | Agent loop, bounded, with tools + memory |
| Global/aggregate questions | GraphRAG or map-reduce summarization |
| Consistent format/tone | Fine-tune (LoRA) + structured output |
| High volume, cost-sensitive | Model cascade + caching + small model default |
11. Quick reference
Order: prompt → RAG → tools → agents → fine-tune (last) RAG = chunk well + hybrid retrieval + rerank ; eval retrieval separately Output = constrained JSON/grammar so parsing never breaks Agents = bounded reason-act loops + function calling + MCP tools + memory Context = retrieve less but better ; mind lost-in-the-middle Eval = golden set + LLM-judge + RAGAS (faithfulness), run in CI Safety = treat retrieved/tool text as UNTRUSTED (prompt injection); authz in code Ops = cache + route + stream + trace ; cap tokens & agent loops
12. Interview Q&A
- RAG vs fine-tuning — when each?RAG for knowledge (facts, freshness, private, citable). Fine-tuning for behaviour (tone, format, latency). Knowledge changes — don't bake it into weights.
- Why add a reranker?Embedding retrieval (bi-encoder) is fast but coarse; a cross-encoder reranker scores query-doc pairs jointly and reorders top-k, sharply improving the context. Retrieve-then-rerank is standard.
- Bi-encoder vs cross-encoder?Bi-encoder embeds query and docs independently (fast, indexable); cross-encoder processes the pair together (accurate, slow). Use bi-encoder to retrieve, cross-encoder to rerank the shortlist.
- What is prompt injection and how do you mitigate it?Untrusted text containing instructions that hijack the model (esp. indirect, via RAG/tools in agents). Separate data from instructions, never execute retrieved text as commands, least-privilege tools, validate I/O, authz in code — defense in depth.
- How do you evaluate a RAG system?Separately measure retrieval (recall@k, context precision/recall) and generation (faithfulness, answer relevance) on a golden set (RAGAS). Run in CI.
- What is 'lost in the middle'?Models use start/end of long context better than the middle. Put critical content at the edges, retrieve less-but-better; bigger context ≠ better quality.
- How do agents decide what to do?A reason-act loop: reason → emit a structured tool call → execute → observe → repeat until done. Bound with step limits, allow-lists, timeouts.
- What is MCP?Model Context Protocol — open standard connecting LLMs to tools/data via a uniform client-server interface, replacing bespoke per-tool glue.
- How do you cut LLM cost and latency?Prompt/semantic/prefix caching, route easy requests to small models, stream, trim context, quantized serving, cap output + agent loops. Trace to find spend.
- How do you reduce hallucination?Ground in retrieved context, require citations, allow 'I don't know', constrain output, verify with a second pass, improve retrieval, lower temperature. Mostly a grounding problem.
- When NOT to use RAG?For behaviour/format (fine-tune), exact structured data (tools/SQL), global/aggregate reasoning (GraphRAG/map-reduce), or small stable knowledge (just put it in context).
- Why constrain output to a schema?Downstream code needs reliable structure; constrained decoding / JSON-schema / grammars guarantee parseable output instead of hoping the model formats correctly.