← Cheatsheets

CHEATSHEET · GENAI · APPLIED LLM SYSTEMS

GenAI Engineering — The Deep Applied Cheatsheet.

genai rag agents llm-apps
Building with LLMs is mostly context engineering and evaluation, not model training. The model is a fixed black box; your job is to get the right tokens into the prompt, constrain the output, measure quality honestly, defend against injection, and keep cost/latency sane. This sheet covers the full applied stack — the capability ladder, prompting, RAG internals, agents, context limits, evaluation, safety, and LLMOps.

1. The capability ladder (cheapest first)

  1. Prompt engineering — instructions, examples, format. Always start here; instant, free.
  2. RAG — inject retrieved knowledge for facts/freshness/private data/citations.
  3. Tools / function calling — let the model act (search, code, APIs, DB).
  4. Agents — multi-step reason→act loops with memory + tools for open-ended tasks.
  5. Fine-tuning — last resort, for style/format/latency, not for facts.
don't fine-tune for knowledge Fine-tuning teaches behaviour/format, not facts — and it bakes them in stale. For knowledge that changes or must be cited, use RAG. Reach for fine-tuning to fix tone, structure, or to compress a long prompt into the weights for latency/cost. Climb the ladder only when the cheaper rung fails.

2. Prompting that holds up

  • Be explicit: role, task, constraints, output format. Show 1–3 examples (few-shot) for tricky formats; zero-shot for simple ones.
  • Chain-of-thought / reasoning for multi-step problems. With reasoning models, ask for the answer and let them think internally; don't double-prompt CoT.
  • Structured output — JSON-schema / grammar-constrained decoding so downstream parsing never breaks.
  • Decomposition — break complex tasks into steps/sub-prompts; self-consistency (sample n, vote) for hard reasoning.
  • Order for caching: put stable content (system prompt, instructions, few-shot) first (prefix-cacheable), variable content last.

3. Embeddings & vector search

  • Embedding — dense vector where semantic similarity ≈ closeness (cosine/dot). Pick a model sized to your latency/quality (and language/domain); normalize for cosine.
  • ANN index — exact search doesn't scale, so use approximate nearest neighbour: HNSW (graph, great recall/latency, more memory) or IVF-PQ (clustered + compressed, memory-efficient). Tune ef_search/nprobe for recall vs speed.
  • Vector stores — pgvector (boring, correct default), Qdrant, Weaviate, Milvus, FAISS (local/in-process), Pinecone (managed).
  • Metadata filtering — combine vector similarity with structured filters (tenant, date, source) — essential for multi-tenant and freshness.

4. RAG pipeline (in depth)

OFFLINE: ingest → chunk → embed → index (+ metadata)
ONLINE:  query → [rewrite] → embed → retrieve (dense + BM25)
         → rerank (cross-encoder) → pack (budget) → generate → cite
EVAL:    retrieval (recall@k, ctx precision) + generation (faithfulness)
  • Chunking — structure/semantic-aware beats fixed size; self-contained chunks, overlap, metadata. Consider small-to-big / parent-document retrieval.
  • Hybrid retrieval — dense (semantic) + sparse (BM25 keyword) fused (e.g. reciprocal rank fusion). Catches both meaning and exact terms/codes.
  • Reranking — a cross-encoder scores (query, doc) jointly; reorder top-k (50→5). Retrieve-then-rerank consistently beats pure vector.
  • Query transforms — rewriting, multi-query, HyDE (hypothetical doc embedding), step-back.
  • GraphRAG — build an entity/relation graph for global/connected questions that chunk retrieval misses.
retrieval quality is the ceiling A great model can't answer from bad context. Most "the LLM is dumb" bugs are retrieval bugs — wrong chunks, no reranker, chunk too big/small, missing metadata filter. Evaluate retrieval (recall@k, hit rate) separately from generation so you fix the real bottleneck.

5. Agents & tools

  • Loop (ReAct): reason → choose tool → execute → observe result → repeat until done.
  • Function calling — tools defined as JSON schemas; model emits a structured call, you run it and feed the result back.
  • MCP (Model Context Protocol) — open standard connecting models to tools/data via a uniform client-server interface; the emerging integration layer ("USB-C for tools").
  • Patterns — single-agent + tools (start here), planner/executor, reflection/critique, multi-agent (only when justified — coordination cost is real).
  • Memory — short-term (conversation window/summary) + long-term (vector store of past facts).
  • Bound everything — step limits, tool allow-lists, timeouts, cost caps, human-in-the-loop for risky actions.
  • Frameworks: LangGraph (graphs/state), OpenAI Agents SDK, CrewAI, LlamaIndex, or a plain function-calling loop. Use the simplest that works.

6. Context window management

  • Long context ≠ free: cost + latency grow with tokens, and models suffer "lost in the middle" — info mid-prompt gets under-used. Put critical content at the start or end.
  • Retrieve less but better (strong reranking) instead of stuffing everything.
  • Compress/summarize history; trim with windowing or rolling summaries; track a token budget per request.
  • Test the same fact at different prompt positions to measure your model's effective usable context.

7. Evaluation (the part everyone skips)

  • Build a golden eval set from real queries early. You can't improve what you don't measure.
  • LLM-as-judge for open-ended quality — calibrate against human labels, watch for position/verbosity bias; use pairwise or rubric scoring.
  • Deterministic checks where possible — exact match, schema validation, unit tests on tool outputs.
  • RAG metrics (RAGAS-style): faithfulness (grounded in context?), answer relevance, context precision/recall.
  • Run evals in CI on every prompt/model/index change — prompts are code and regress silently. Track cost + latency alongside quality.
  • Tools: RAGAS, promptfoo, LangSmith, Braintrust, OpenAI Evals, DeepEval.

8. Safety & reliability

  • Prompt injection is the top risk — untrusted content (retrieved docs, tool output, user input, fetched pages) can hijack instructions. Indirect injection via RAG/web data is especially dangerous in acting agents.
  • Mitigations (defense in depth): separate data from instructions (clear delimiters), never execute retrieved text as commands, constrain tool permissions (least privilege), validate inputs/outputs, guardrail models/classifiers, human-in-the-loop for risky actions. Put real authorization in code, not the prompt.
  • Hallucination control — ground with RAG, require citations, allow "I don't know", constrain output, verify with a second pass, lower temperature for facts.
  • Guardrails — input/output validation, PII redaction, content filters, schema validation, jailbreak detection.
  • Data privacy — PII handling, tenant isolation in the vector store, retention/zero-retention with the provider.

9. LLMOps — cost, latency, observability

  • Cache — exact-match, semantic (near-duplicate queries), and prompt-prefix caching for shared system prompts/few-shot/RAG headers.
  • Route / cascade — small cheap model for easy turns, big model only for hard ones; classify difficulty up front.
  • Token discipline — trim history, retrieve fewer chunks, cap max output tokens, structured output to stop rambling.
  • Stream tokens for perceived latency; measure TTFT + cost/request + tokens/request.
  • Trace every call (prompt, retrieved context, tool calls, tokens, latency, cost) — you can't debug or cost-optimize a black box without traces.
  • Govern — per-user budgets/rate limits; bound agent loops so they can't burn unbounded tokens; fallback/provider failover.

10. Common architectures

NeedPattern
Answer over private docsRAG (hybrid + rerank) with citations
Exact data / transactionsTool/function calling to APIs/SQL, not retrieval
Multi-step open taskAgent loop, bounded, with tools + memory
Global/aggregate questionsGraphRAG or map-reduce summarization
Consistent format/toneFine-tune (LoRA) + structured output
High volume, cost-sensitiveModel cascade + caching + small model default

11. Quick reference

Order: prompt → RAG → tools → agents → fine-tune (last)
RAG = chunk well + hybrid retrieval + rerank ; eval retrieval separately
Output = constrained JSON/grammar so parsing never breaks
Agents = bounded reason-act loops + function calling + MCP tools + memory
Context = retrieve less but better ; mind lost-in-the-middle
Eval = golden set + LLM-judge + RAGAS (faithfulness), run in CI
Safety = treat retrieved/tool text as UNTRUSTED (prompt injection); authz in code
Ops = cache + route + stream + trace ; cap tokens & agent loops

12. Interview Q&A

  • RAG vs fine-tuning — when each?RAG for knowledge (facts, freshness, private, citable). Fine-tuning for behaviour (tone, format, latency). Knowledge changes — don't bake it into weights.
  • Why add a reranker?Embedding retrieval (bi-encoder) is fast but coarse; a cross-encoder reranker scores query-doc pairs jointly and reorders top-k, sharply improving the context. Retrieve-then-rerank is standard.
  • Bi-encoder vs cross-encoder?Bi-encoder embeds query and docs independently (fast, indexable); cross-encoder processes the pair together (accurate, slow). Use bi-encoder to retrieve, cross-encoder to rerank the shortlist.
  • What is prompt injection and how do you mitigate it?Untrusted text containing instructions that hijack the model (esp. indirect, via RAG/tools in agents). Separate data from instructions, never execute retrieved text as commands, least-privilege tools, validate I/O, authz in code — defense in depth.
  • How do you evaluate a RAG system?Separately measure retrieval (recall@k, context precision/recall) and generation (faithfulness, answer relevance) on a golden set (RAGAS). Run in CI.
  • What is 'lost in the middle'?Models use start/end of long context better than the middle. Put critical content at the edges, retrieve less-but-better; bigger context ≠ better quality.
  • How do agents decide what to do?A reason-act loop: reason → emit a structured tool call → execute → observe → repeat until done. Bound with step limits, allow-lists, timeouts.
  • What is MCP?Model Context Protocol — open standard connecting LLMs to tools/data via a uniform client-server interface, replacing bespoke per-tool glue.
  • How do you cut LLM cost and latency?Prompt/semantic/prefix caching, route easy requests to small models, stream, trim context, quantized serving, cap output + agent loops. Trace to find spend.
  • How do you reduce hallucination?Ground in retrieved context, require citations, allow 'I don't know', constrain output, verify with a second pass, improve retrieval, lower temperature. Mostly a grounding problem.
  • When NOT to use RAG?For behaviour/format (fine-tune), exact structured data (tools/SQL), global/aggregate reasoning (GraphRAG/map-reduce), or small stable knowledge (just put it in context).
  • Why constrain output to a schema?Downstream code needs reliable structure; constrained decoding / JSON-schema / grammars guarantee parseable output instead of hoping the model formats correctly.
← prev: LLM Training all cheatsheets →
© cvam — written in plaintext, served warm