GenAI Systems Interview Questions

Questions for AI-engineer, LLM-application, and GenAI-platform roles — graded easy → hard with full answers. Click to expand. Companion to the GenAI Engineering cheatsheet.

easy fundamentals medium applied LLM apps hard senior / systems

Easy — fundamentals

What is RAG and why use it? easy

Retrieval-Augmented Generation: at query time you retrieve relevant documents from a knowledge base and inject them into the prompt so the model answers from that context instead of only its parametric memory. Use it for facts, freshness, private/proprietary data, and citations — anything the base model doesn't reliably know or that changes over time. It reduces hallucination and lets you update knowledge by updating the index, not retraining the model.

What is an embedding? easy

A dense vector that represents the meaning of a piece of text (or image/audio) in a high-dimensional space, where semantic similarity ≈ vector closeness (cosine/dot product). An embedding model maps text → vector; you embed your documents once, store the vectors in a vector index, then embed the query and retrieve the nearest document vectors. It's the backbone of semantic search and RAG retrieval.

What is a system prompt? easy

A high-priority instruction block, set by the application (not the end user), that defines the model's role, tone, rules, capabilities, and constraints for the whole conversation. It's prepended ahead of user messages. Because it's stable across requests, it's a prime target for prefix caching. It should never contain secrets/credentials, and it can be partially overridden by clever user input (prompt injection), so don't rely on it alone for security.

What's the difference between prompt engineering and fine-tuning? easy

Prompt engineering shapes behavior at inference time via instructions, examples, and format — cheap, instant, no training. Fine-tuning changes the model's weights by training on examples — it bakes in style/format/behavior but costs compute and data and goes stale. Start with prompting (and RAG for knowledge); reach for fine-tuning only when prompting can't get the consistency/format/latency you need.

What is temperature in sampling? easy

A knob that scales the logits before softmax, controlling randomness. Low temperature (→0) makes the distribution peaky → near-deterministic, picks the most likely tokens (good for factual/code). High temperature flattens it → more diverse/creative but more error-prone. Often paired with top-p (nucleus) or top-k to truncate the tail. For reproducible/evaluation runs, set temperature 0 (greedy).

What is an embedding? easy

A dense vector for text/data where semantic similarity maps to vector closeness — powers search, clustering, retrieval.

What is RAG? easy

Retrieval-Augmented Generation: retrieve relevant docs (usually vector search) into the prompt so the model answers from grounded, current context.

What is a vector database? easy

A store for ANN similarity search over embeddings (HNSW/IVF) with metadata filtering — Pinecone, Weaviate, pgvector, Milvus.

What is chunking in RAG? easy

Splitting documents into passages for embedding/retrieval; chunk size + overlap strongly affect retrieval quality.

What is a system prompt? easy

The top-priority instruction defining the model's role, constraints, and behavior, separate from user input.

What is function/tool calling? easy

The model emits a structured call (name + JSON args) your code executes, returning results for the model to use.

What is an AI agent? easy

An LLM that plans and acts in a loop — choosing tools, observing results, iterating toward a goal — not single-shot.

What is MCP? easy

Model Context Protocol — an open standard connecting models/agents to external tools and data via a uniform interface.

What is a hallucination? easy

Fluent but false/unsupported output; mitigated with grounding (RAG), citations, lower temperature, and evaluation.

What is a reranker? easy

A cross-encoder that re-scores retrieved candidates by true relevance, reordering top-k more precisely than embedding similarity.

Medium — applied LLM apps

Walk through a production RAG pipeline end to end. medium

Offline (indexing): ingest documents → chunk them (structure/semantic-aware, with overlap + metadata) → embed each chunk → store vectors + metadata in a vector index. Online (query): embed the query → retrieve top-k by vector similarity, ideally hybrid (dense + BM25 keyword) → rerank the candidates with a cross-encoder → pack the best chunks into the prompt within the token budget → generate with the LLM → return the answer with citations. Around it: query rewriting, caching, and evaluation of both retrieval and generation. Most quality problems live in chunking and retrieval, not the model.

Why add a reranker, and how is it different from the embedding retriever? medium

The embedding retriever is a bi-encoder: it encodes query and documents independently into vectors and compares them — fast and scalable (you can pre-index millions), but coarse because it never sees query and document together. A reranker is a cross-encoder: it takes (query, document) as a pair and scores their relevance jointly with full attention — much more accurate but too slow to run over the whole corpus. So you use the cheap retriever to get top-k (say 50) candidates, then the expensive reranker to reorder and keep the best few. Retrieve-then-rerank consistently beats pure vector search.

How do you choose a chunking strategy? medium

Goal: chunks that are self-contained (answer a question on their own) but focused (not diluting the embedding). Fixed-size token chunking is the baseline; better is structure-aware (split on headings/paragraphs/code blocks) or semantic chunking (split where topic shifts). Add overlap so context isn't cut mid-thought, attach metadata (source, section, date) for filtering and citations, and consider small-to-big / parent-document retrieval (embed small chunks for precision, return the larger parent for context). Too big → diluted embeddings + wasted tokens + lost-in-the-middle; too small → fragmented context. Tune against a retrieval eval set.

What is function calling / tool use, and how does the loop work? medium

You give the model a set of tools described as JSON schemas (name, params). When the model decides a tool is needed, it emits a structured tool call (function name + arguments) instead of a text answer. Your application executes that function, returns the result back into the conversation, and the model continues — possibly calling more tools — until it produces a final answer. This is the basis of agents (a reason→act→observe loop). Keep tools well-described, validate arguments, constrain permissions, and bound the loop (step limits, timeouts).

How would you evaluate an LLM application? medium

Build a golden set of real queries with expected behavior early. Use the right check per output type: exact/structured assertions where possible, and LLM-as-judge (calibrated against human labels) for open-ended quality. For RAG, measure retrieval and generation separately: retrieval (context precision/recall, hit-rate@k) and generation (faithfulness — grounded in context?, answer relevance) à la RAGAS. Run evals in CI on every prompt/model/index change because prompts are code and regress silently. Track cost and latency alongside quality. Tools: RAGAS, promptfoo, LangSmith, Braintrust.

Walk through a production RAG pipeline. medium

Ingest→chunk→embed→index; query: rewrite→retrieve top-k (hybrid)→rerank→assemble prompt with context+citations→generate→guardrail→eval/log.

Hybrid vs pure vector search? medium

Hybrid fuses dense (semantic) + sparse/BM25 (lexical) so exact terms/names/codes aren't lost — usually beats either, combined via RRF.

How do you evaluate a RAG system? medium

Measure retrieval (recall@k, MRR) and generation (faithfulness, answer relevance) separately via LLM-as-judge + human review on a golden set.

Chunking strategies and trade-offs? medium

Fixed+overlap is simple but splits ideas; semantic/structural respects boundaries; small = precise but context-poor, large = noisier — often store small, retrieve with surrounding window.

How do you defend against prompt injection? medium

Treat retrieved/user text as untrusted data not instructions, enforce instruction hierarchy, sandbox tools least-privilege, validate I/O, confirm destructive actions.

RAG vs fine-tuning vs long context — when? medium

RAG for large/changing facts with citations; fine-tune for style/format/behavior; long context for whole-doc reasoning when it fits — often combine.

How do you manage context limits in long chats? medium

Summarize/compress history, retrieve only relevant turns, sliding window, keep system prompt cacheable — don't naively concatenate.

What is LLMOps beyond MLOps? medium

Versioning prompts/models/datasets, eval pipelines + regression gates, online monitoring (latency/cost/quality/drift/toxicity), guardrails, feedback, cost governance.

How do you control agent cost and reliability? medium

Cap iterations/tool calls, validate args, structured error feedback, caching, fallbacks/human handoff, full tracing.

What is a cross-encoder vs bi-encoder? medium

Bi-encoder embeds query and doc separately (fast, indexable); cross-encoder jointly encodes the pair (accurate, slow) — used as a reranker on top-k.

Hard — senior & systems

What is prompt injection, why is it hard to fully solve, and how do you mitigate it? hard

Prompt injection is when untrusted text — user input, a retrieved document, a web page, or tool output — contains instructions that the model follows, overriding your system prompt ("ignore previous instructions and…"). Indirect injection (malicious content inside retrieved/RAG data or a fetched page) is especially dangerous in agents that can act. It's hard to fully solve because, to an LLM, instructions and data are the same token stream — there's no hard privilege boundary like in classic injection. Mitigations are defense-in-depth: never treat retrieved/tool text as commands (clearly delimit data vs instructions), constrain tool permissions and require confirmation for risky actions, validate/sanitize inputs and outputs, use guardrail models/classifiers, apply least privilege and human-in-the-loop, and assume the model can be tricked — so put real authorization in your code, not in the prompt.

The model gives bad answers in your RAG app. How do you isolate whether it's retrieval or generation? hard

Decompose the pipeline and test each stage. First inspect what was retrieved for the failing query: if the right chunk isn't in the context, it's a retrieval bug — check chunking, embedding model, top-k, hybrid search, reranker, and metadata filters (measure recall@k against known-good chunks). If the right context was retrieved but the answer is still wrong, it's a generation/grounding bug — check the prompt (is the context actually used? lost-in-the-middle ordering?), the faithfulness (is it hallucinating despite context?), token budget/truncation, and the instruction to cite/abstain. The discipline: evaluate retrieval and generation separately so you fix the real ceiling — most "the LLM is dumb" reports are retrieval failures.

When should you NOT use RAG, and what are the alternatives? hard

RAG shines for factual lookup over a knowledge base, but it's the wrong tool when: (1) the task needs behavior/format/style change rather than knowledge → fine-tune; (2) the answer needs reasoning over the whole corpus or aggregations ("summarize all complaints this quarter") → retrieval of a few chunks misses the global picture; consider GraphRAG, structured queries over a database, or map-reduce summarization; (3) the knowledge is small and stable enough to fit in the context/system prompt → just include it (and use prefix caching); (4) you need exact structured data → query the database/API via tools, not fuzzy retrieval. Often the best system is hybrid: tools/SQL for precise data, RAG for unstructured docs, fine-tuning for format.

What is 'lost in the middle' and how do you design around it? hard

Empirically, LLMs use information at the beginning and end of a long context far more reliably than information in the middle — accuracy sags for facts buried mid-prompt. So a longer context window doesn't linearly buy you more usable knowledge. Design around it: retrieve less but better (strong reranking so the key chunk is in the top few), order the most relevant context at the start or end, compress/summarize rather than dumping everything, and don't assume "just stuff the whole doc in" works. Measure it: test the same fact at different prompt positions. It's also why context management and retrieval quality matter more than raw context length.

How do you control cost and latency for an LLM product at scale? hard

Layered approach. Caching: exact-match response cache, semantic cache for near-duplicate queries, and prompt prefix caching for shared system prompts / few-shot / RAG headers. Routing / cascades: send easy turns to a small cheap model and escalate only hard ones to the big model; classify difficulty up front. Token discipline: trim history (windowing/summarization), retrieve fewer chunks, cap max output tokens, use structured output to avoid rambling. Serving: quantized models, continuous batching, GQA models. Perceived latency: stream tokens. Governance: per-user budgets/rate limits, and bound agent loops so they can't burn unbounded tokens. And trace everything (prompt, retrieved context, tool calls, tokens, latency, cost) — you can't optimize spend you can't see.

What is MCP, and why does it matter for agent systems? hard

MCP (Model Context Protocol) is an open standard for connecting LLM applications to external tools and data sources through a uniform client-server interface, instead of writing bespoke glue for every integration. A model host (client) can discover and call any MCP server that exposes tools, resources, and prompts. Why it matters: it decouples the agent from specific integrations, so tools become reusable and composable across apps and vendors — the "USB-C for tools" idea. It also concentrates the security surface: because MCP servers can expose powerful actions and feed untrusted data into the context, you must scope permissions, vet servers, and treat their output as untrusted (indirect prompt-injection risk).

How do you reduce hallucination in a production system? hard

Treat it as mostly a grounding + verification problem, not a model-quality complaint. (1) Ground answers in retrieved context (RAG) and instruct the model to answer only from it. (2) Require citations to the source chunk so claims are traceable (and verifiable downstream). (3) Allow and reward "I don't know" / abstention when context is insufficient. (4) Constrain output (schemas, allowed values) to remove room for invention. (5) Verify with a second pass — a faithfulness check / LLM-judge that the answer is supported by the context, or self-consistency for reasoning. (6) Improve retrieval (the usual root cause). (7) Lower temperature for factual tasks. And measure faithfulness on a golden set so you know if it's improving.

How do you reduce hallucination measurably? hard

Ground with RAG + require citations, instruct abstention, lower temperature, add a faithfulness judge (is the claim supported?), and track a hallucination metric per change.

Design retrieval for multi-hop questions. hard

Query decomposition + iterative/agentic retrieval (retrieve→reason→retrieve), or a graph/structured index — single-shot top-k can't chain facts.

How do you evaluate and guard against jailbreaks? hard

Maintain an adversarial test set, red-team continuously, layer input/output filters + instruction hierarchy, and monitor production for new bypasses (it's an arms race).

How do you do semantic caching for LLM apps? hard

Cache by embedding-similarity of the query (not exact match) to reuse answers for paraphrases, with thresholds + TTL and care to avoid serving stale/personalized responses.

How do you choose an embedding model? hard

By domain fit (benchmark on your data/queries with recall@k), dimensionality vs cost, max sequence length, multilingual needs, and whether you can fine-tune it.

How do you handle structured output reliably? hard

Constrained/JSON-schema decoding or function-calling with validation + retry on parse failure; don't rely on prompt-only formatting for machine-consumed output.

How do you fine-tune for tool use / agents? hard

SFT on trajectories of (state→tool call→observation→next action), plus preference tuning on good vs bad tool selection; improves reliability over prompting alone.

How do you monitor a GenAI app in production? hard

Track latency/cost/token usage, quality via sampled LLM-judge + user feedback, drift in inputs/outputs, safety violations, and run eval regressions on every prompt/model change with rollback.

RAG vs long-context for a 200-page doc — decide. hard

If it fits the window and you need cross-document reasoning, long context is simpler; if cost/latency or many docs matter, RAG with good chunking + reranking scales better — sometimes hybrid (retrieve sections, long-context reason).

How do you prevent sensitive-data leakage in RAG? hard

Per-user access control on the index (filter by permissions at retrieval), PII redaction, don't embed secrets, and output filtering — retrieval must respect the same authz as the source systems.

Scenario-based

Your RAG system keeps returning irrelevant chunks. How do you fix retrieval? hard

Walk the pipeline. Chunking: too big/small or split mid-thought — try semantic/structured chunking with overlap. Embedding model: domain mismatch — use a better/domain-tuned embedder. Add a reranker (cross-encoder) on top-k to reorder by true relevance. Use hybrid search (BM25 + dense) so keyword matches aren't lost. Add metadata filtering and query rewriting/expansion (especially for multi-turn). Measure with retrieval metrics (recall@k) on a labeled set before/after each change.

The model confidently states wrong facts. How do you reduce hallucination? medium

Ground it: RAG so answers come from retrieved sources, require citations and instruct "say I don't know if not in context." Lower temperature for factual tasks. Add output validation/guardrails and a faithfulness check (LLM-as-judge: is the answer supported by the context?). Constrain scope in the system prompt. For high-stakes, add human review. Track a hallucination metric on an eval set so changes are measurable.

An autonomous agent gets stuck in loops or keeps mis-calling tools. What do you do? hard

Bound and observe it. Set a max-iterations / step budget and loop detection. Improve tool descriptions and schemas (most tool errors are ambiguous specs) and validate arguments before executing. Return clear, structured tool errors so the model can self-correct. Add fallbacks/human handoff on repeated failure, and full tracing (every thought/tool call) to debug. Constrain the action space and give few-shot examples of correct tool use.

User-supplied documents can carry prompt injection. How do you defend the app? hard

Treat retrieved/user content as untrusted data, never instructions. Keep a clear instruction hierarchy (system > developer > user > retrieved) and structurally separate them. Sandbox tools with least privilege (no raw DB/file/shell from model output without checks), require confirmation for destructive actions. Sanitize/validate inputs and outputs, scan for injection patterns, and constrain output formats. Assume injection will happen — limit blast radius rather than trusting the prompt alone.

Stakeholders ask: RAG or fine-tuning for our internal-knowledge bot? How do you decide? medium

Match tool to need. RAG for factual, frequently-changing, or large knowledge — you can update the index without retraining, and get citations. Fine-tuning for style/format/tone, domain phrasing, or compressing instructions — not for injecting fresh facts (it bakes them in and goes stale). Often both: fine-tune for behavior, RAG for knowledge. Start with RAG (cheaper, updatable); fine-tune later if the model can't follow format or domain style.

How would you evaluate the quality of a GenAI feature before and after changes? medium

Build a golden eval set of representative inputs with expected behavior. Combine automatic metrics (exact/semantic match, retrieval recall, faithfulness), LLM-as-judge for open-ended quality (with rubrics, and validated against humans), and periodic human eval. Run it as a regression test in CI on every prompt/model change, track scores over time, and segment by query type. Never ship a prompt/model swap on vibes — diff the eval.

Retrieval returns right docs but answer ignores them. Fix? hard

Instruct answer-from-context + cite, put context before the question, cut distractor chunks, and ensure context isn't truncated past the window.

Multi-hop questions RAG can't answer. What helps? hard

Query decomposition / multi-step (agentic) retrieval or a graph index; single-shot retrieval can't chain facts.

Latency too high from reranking everything. Optimize. medium

Retrieve a large candidate set cheaply (ANN), rerank only top ~20–50, cache results, use a faster cross-encoder.

Bot leaks its system prompt via crafted input. Response? hard

Assume it leaks — remove secrets/creds from the prompt, enforce instruction hierarchy, filter output, least-privilege any tools.

Eval scores good but users complain. Why? medium

Eval set unrepresentative — add real production queries, segment by type, add human eval + online feedback, check drift.

Embeddings retrieve poorly on domain jargon. Fix? medium

Domain-tuned embedding model, hybrid BM25 for exact jargon, metadata enrichment, query expansion.

Agent keeps calling the wrong tool. Change what? hard

Tighten tool names/descriptions/schemas, add few-shot examples, validate args, return structured errors, shrink the toolset.

Add citations users can trust. How? medium

Carry chunk→source metadata through retrieval, cite chunk IDs, and verify each cited claim is supported before display.

Costs spike — every query hits the biggest model. Optimize. medium

Route by difficulty (small model for easy), semantic-cache frequent/shared prefixes, cap max tokens; reserve the big model for hard cases.

Ship safely to production — what guardrails? hard

Input/output filtering, PII redaction, injection/jailbreak checks, rate limits, human review for high-risk actions, continuous eval + monitoring + rollback.

what industry actually asks

For AI-engineer / LLM-app roles (most GenAI hiring right now), the core is the medium/hard block: design a RAG pipeline, retrieval vs generation debugging, rerankers, agents + function calling + MCP, evaluation, prompt-injection, and cost/latency at scale. Expect a system-design round: "design a chatbot over our docs" or "build an agent that does X" — they grade whether you reach for RAG vs fine-tuning correctly, evaluate retrieval separately, handle prompt injection, and think about cost/latency/observability. For junior roles, the easy block (what's RAG, embeddings, temperature, prompt vs fine-tune) plus a small build task. Saying "I'd evaluate retrieval and generation separately, and I treat retrieved text as untrusted" signals seniority fast.

GenAI / LLM Systems — Interview Questions.

Easy — fundamentals

Medium — applied LLM apps

Hard — senior & systems

Scenario-based