GenAI Engineering Cheatsheet

Building with LLMs is mostly context engineering and evaluation, not model training. The model is a fixed black box; your job is to get the right tokens into the prompt, constrain the output, measure quality honestly, defend against injection, and keep cost/latency sane. This sheet covers the full applied stack — the capability ladder, prompting, RAG internals, agents, context limits, evaluation, safety, and LLMOps.

1. The capability ladder (cheapest first)

Prompt engineering — instructions, examples, format. Always start here; instant, free.
RAG — inject retrieved knowledge for facts/freshness/private data/citations.
Tools / function calling — let the model act (search, code, APIs, DB).
Agents — multi-step reason→act loops with memory + tools for open-ended tasks.
Fine-tuning — last resort, for style/format/latency, not for facts.

don't fine-tune for knowledge Fine-tuning teaches behaviour/format, not facts — and it bakes them in stale. For knowledge that changes or must be cited, use RAG. Reach for fine-tuning to fix tone, structure, or to compress a long prompt into the weights for latency/cost. Climb the ladder only when the cheaper rung fails.

2. Prompting that holds up

Be explicit: role, task, constraints, output format. Show 1–3 examples (few-shot) for tricky formats; zero-shot for simple ones.
Chain-of-thought / reasoning for multi-step problems. With reasoning models, ask for the answer and let them think internally; don't double-prompt CoT.
Structured output — JSON-schema / grammar-constrained decoding so downstream parsing never breaks.
Decomposition — break complex tasks into steps/sub-prompts; self-consistency (sample n, vote) for hard reasoning.
Order for caching: put stable content (system prompt, instructions, few-shot) first (prefix-cacheable), variable content last.

3. Embeddings & vector search

Embedding — dense vector where semantic similarity ≈ closeness (cosine/dot). Pick a model sized to your latency/quality (and language/domain); normalize for cosine.
ANN index — exact search doesn't scale, so use approximate nearest neighbour: HNSW (graph, great recall/latency, more memory) or IVF-PQ (clustered + compressed, memory-efficient). Tune ef_search/nprobe for recall vs speed.
Vector stores — pgvector (boring, correct default), Qdrant, Weaviate, Milvus, FAISS (local/in-process), Pinecone (managed).
Metadata filtering — combine vector similarity with structured filters (tenant, date, source) — essential for multi-tenant and freshness.

4. RAG pipeline (in depth)

OFFLINE: ingest → chunk → embed → index (+ metadata)
ONLINE:  query → [rewrite] → embed → retrieve (dense + BM25)
         → rerank (cross-encoder) → pack (budget) → generate → cite
EVAL:    retrieval (recall@k, ctx precision) + generation (faithfulness)

Chunking — structure/semantic-aware beats fixed size; self-contained chunks, overlap, metadata. Consider small-to-big / parent-document retrieval.
Hybrid retrieval — dense (semantic) + sparse (BM25 keyword) fused (e.g. reciprocal rank fusion). Catches both meaning and exact terms/codes.
Reranking — a cross-encoder scores (query, doc) jointly; reorder top-k (50→5). Retrieve-then-rerank consistently beats pure vector.
Query transforms — rewriting, multi-query, HyDE (hypothetical doc embedding), step-back.
GraphRAG — build an entity/relation graph for global/connected questions that chunk retrieval misses.

retrieval quality is the ceiling A great model can't answer from bad context. Most "the LLM is dumb" bugs are retrieval bugs — wrong chunks, no reranker, chunk too big/small, missing metadata filter. Evaluate retrieval (recall@k, hit rate) separately from generation so you fix the real bottleneck.

5. Agents & tools

Loop (ReAct): reason → choose tool → execute → observe result → repeat until done.
Function calling — tools defined as JSON schemas; model emits a structured call, you run it and feed the result back.
MCP (Model Context Protocol) — open standard connecting models to tools/data via a uniform client-server interface; the emerging integration layer ("USB-C for tools").
Patterns — single-agent + tools (start here), planner/executor, reflection/critique, multi-agent (only when justified — coordination cost is real).
Memory — short-term (conversation window/summary) + long-term (vector store of past facts).
Bound everything — step limits, tool allow-lists, timeouts, cost caps, human-in-the-loop for risky actions.
Frameworks: LangGraph (graphs/state), OpenAI Agents SDK, CrewAI, LlamaIndex, or a plain function-calling loop. Use the simplest that works.

6. Context window management

Long context ≠ free: cost + latency grow with tokens, and models suffer "lost in the middle" — info mid-prompt gets under-used. Put critical content at the start or end.
Retrieve less but better (strong reranking) instead of stuffing everything.
Compress/summarize history; trim with windowing or rolling summaries; track a token budget per request.
Test the same fact at different prompt positions to measure your model's effective usable context.

7. Evaluation (the part everyone skips)

Build a golden eval set from real queries early. You can't improve what you don't measure.
LLM-as-judge for open-ended quality — calibrate against human labels, watch for position/verbosity bias; use pairwise or rubric scoring.
Deterministic checks where possible — exact match, schema validation, unit tests on tool outputs.
RAG metrics (RAGAS-style): faithfulness (grounded in context?), answer relevance, context precision/recall.
Run evals in CI on every prompt/model/index change — prompts are code and regress silently. Track cost + latency alongside quality.
Tools: RAGAS, promptfoo, LangSmith, Braintrust, OpenAI Evals, DeepEval.

8. Safety & reliability

Prompt injection is the top risk — untrusted content (retrieved docs, tool output, user input, fetched pages) can hijack instructions. Indirect injection via RAG/web data is especially dangerous in acting agents.
Mitigations (defense in depth): separate data from instructions (clear delimiters), never execute retrieved text as commands, constrain tool permissions (least privilege), validate inputs/outputs, guardrail models/classifiers, human-in-the-loop for risky actions. Put real authorization in code, not the prompt.
Hallucination control — ground with RAG, require citations, allow "I don't know", constrain output, verify with a second pass, lower temperature for facts.
Guardrails — input/output validation, PII redaction, content filters, schema validation, jailbreak detection.
Data privacy — PII handling, tenant isolation in the vector store, retention/zero-retention with the provider.

9. LLMOps — cost, latency, observability

Cache — exact-match, semantic (near-duplicate queries), and prompt-prefix caching for shared system prompts/few-shot/RAG headers.
Route / cascade — small cheap model for easy turns, big model only for hard ones; classify difficulty up front.
Token discipline — trim history, retrieve fewer chunks, cap max output tokens, structured output to stop rambling.
Stream tokens for perceived latency; measure TTFT + cost/request + tokens/request.
Trace every call (prompt, retrieved context, tool calls, tokens, latency, cost) — you can't debug or cost-optimize a black box without traces.
Govern — per-user budgets/rate limits; bound agent loops so they can't burn unbounded tokens; fallback/provider failover.

10. Common architectures

Need	Pattern
Answer over private docs	RAG (hybrid + rerank) with citations
Exact data / transactions	Tool/function calling to APIs/SQL, not retrieval
Multi-step open task	Agent loop, bounded, with tools + memory
Global/aggregate questions	GraphRAG or map-reduce summarization
Consistent format/tone	Fine-tune (LoRA) + structured output
High volume, cost-sensitive	Model cascade + caching + small model default

11. Quick reference

Order: prompt → RAG → tools → agents → fine-tune (last)
RAG = chunk well + hybrid retrieval + rerank ; eval retrieval separately
Output = constrained JSON/grammar so parsing never breaks
Agents = bounded reason-act loops + function calling + MCP tools + memory
Context = retrieve less but better ; mind lost-in-the-middle
Eval = golden set + LLM-judge + RAGAS (faithfulness), run in CI
Safety = treat retrieved/tool text as UNTRUSTED (prompt injection); authz in code
Ops = cache + route + stream + trace ; cap tokens & agent loops

12. Interview Q&A

RAG vs fine-tuning — when each?RAG for knowledge (facts, freshness, private, citable). Fine-tuning for behaviour (tone, format, latency). Knowledge changes — don't bake it into weights.
Why add a reranker?Embedding retrieval (bi-encoder) is fast but coarse; a cross-encoder reranker scores query-doc pairs jointly and reorders top-k, sharply improving the context. Retrieve-then-rerank is standard.
Bi-encoder vs cross-encoder?Bi-encoder embeds query and docs independently (fast, indexable); cross-encoder processes the pair together (accurate, slow). Use bi-encoder to retrieve, cross-encoder to rerank the shortlist.
What is prompt injection and how do you mitigate it?Untrusted text containing instructions that hijack the model (esp. indirect, via RAG/tools in agents). Separate data from instructions, never execute retrieved text as commands, least-privilege tools, validate I/O, authz in code — defense in depth.
How do you evaluate a RAG system?Separately measure retrieval (recall@k, context precision/recall) and generation (faithfulness, answer relevance) on a golden set (RAGAS). Run in CI.
What is 'lost in the middle'?Models use start/end of long context better than the middle. Put critical content at the edges, retrieve less-but-better; bigger context ≠ better quality.
How do agents decide what to do?A reason-act loop: reason → emit a structured tool call → execute → observe → repeat until done. Bound with step limits, allow-lists, timeouts.
What is MCP?Model Context Protocol — open standard connecting LLMs to tools/data via a uniform client-server interface, replacing bespoke per-tool glue.
How do you cut LLM cost and latency?Prompt/semantic/prefix caching, route easy requests to small models, stream, trim context, quantized serving, cap output + agent loops. Trace to find spend.
How do you reduce hallucination?Ground in retrieved context, require citations, allow 'I don't know', constrain output, verify with a second pass, improve retrieval, lower temperature. Mostly a grounding problem.
When NOT to use RAG?For behaviour/format (fine-tune), exact structured data (tools/SQL), global/aggregate reasoning (GraphRAG/map-reduce), or small stable knowledge (just put it in context).
Why constrain output to a schema?Downstream code needs reliable structure; constrained decoding / JSON-schema / grammars guarantee parseable output instead of hoping the model formats correctly.

GenAI Engineering — The Deep Applied Cheatsheet.