AI Engineering — cvam.sight

Curated LLM/ML engineering stack — frameworks, serving, retrieval, eval, observability, agents, fine-tuning. Verdict tags: ★ default pick, solid, niche, commercial. Principle: thin, observable, eval-driven beats a heavy framework you can't debug.

App frameworks / orchestration

Tool	Verdict	Review
LlamaIndex	★ (RAG)	Best-in-class for retrieval/RAG data plumbing — loaders, indexes, query engines. Reach for it when the problem is "data → context".
LangChain / LangGraph	solid	Huge ecosystem; LangChain can be over-abstracted — prefer LangGraph for explicit, stateful agent graphs. Don't hide your prompts behind magic.
DSPy	niche/★	Program + compile prompts (optimize against a metric) instead of hand-tuning. Powerful for serious pipelines; steeper learning curve.
Plain SDK + functions	★ default	For simple apps, the provider SDK + your own code beats any framework — fewer abstractions to debug.

framework ≠ requirement Many "LLM apps" are a prompt + a function call + a retriever. Reach for a framework when you need orchestration/state/eval — not by default. You must always be able to see the exact prompt sent to the model.

Gateways & provider access

Tool	Verdict	Review
LiteLLM	★ default	One OpenAI-compatible API across 100+ providers + a proxy with keys, budgets, fallbacks, logging. The default abstraction layer.
Vercel AI Gateway / OpenRouter	solid	Hosted multi-provider routing + observability + fallbacks. Fast to adopt.
Vercel AI SDK	solid	TS-first streaming/tool-calling SDK for web apps; provider-agnostic via the gateway.

Self-hosted serving / inference

Tool	Verdict	Review
vLLM	★ default	High-throughput serving (PagedAttention, continuous batching). The default for self-hosting open models at scale.
Ollama	★ (local)	Dead-simple local model running for dev/prototyping. Not for high-throughput prod.
TGI / TensorRT-LLM	solid	HF Text Generation Inference; TensorRT-LLM for max NVIDIA perf (harder to operate).
llama.cpp	niche	CPU/edge/quantized (GGUF) inference. Great for laptops/embedded.

Vector stores / retrieval

Tool	Verdict	Review
pgvector (Postgres)	★ default	Vectors in the DB you already run — transactions, joins, one less system. Default until scale demands more.
Qdrant	★ solid	Fast, open, easy filtering + hybrid search. Best dedicated open vector DB.
Weaviate / Milvus	solid	Feature-rich / built for billion-scale (Milvus). More to operate.
Pinecone	commercial	Managed, zero-ops, scales well. Pay to not run infra.

start with pgvector Don't add a dedicated vector DB on day one. pgvector in your existing Postgres handles most workloads with transactions + metadata filtering. Graduate to Qdrant/Milvus only at real scale.

Evaluation & testing

Tool	Verdict	Review
promptfoo	★ default	Declarative prompt/RAG eval + red-teaming in CI — compare prompts/models against test cases. The default open eval.
Ragas	solid	RAG-specific metrics (faithfulness, context precision/recall). Pair with promptfoo.
LangSmith / Langfuse evals	solid	Dataset + eval runs tied to traces. Langfuse is open-source.

no evals = flying blind LLM changes (prompt, model, temperature) regress silently. Build an eval set early and run it in CI on every change — "looks good in the demo" is not a test.

Observability & tracing

Tool	Verdict	Review
Langfuse	★ default	Open-source LLM tracing, costs, prompt management, evals. Self-host or cloud. Default for observability.
LangSmith	solid	Polished tracing/eval (LangChain-native, works standalone). Commercial.
Arize Phoenix	solid	Open tracing + eval + drift, OpenTelemetry-based.

Agents & tool use

Tool	Verdict	Review
LangGraph	★ default	Explicit state-machine/graph agents — controllable, debuggable, supports human-in-the-loop + persistence. Default for real agents.
CrewAI / AutoGen	solid	Multi-agent collaboration frameworks. Fast to demo; watch token cost + loops.
MCP (Model Context Protocol)	★ (standard)	Open standard for tool/data connectors to LLMs. Increasingly the way to expose tools to agents.

Fine-tuning & training

Tool	Verdict	Review
PEFT / LoRA (HF)	★ default	Parameter-efficient fine-tuning — adapt a model on one GPU. The default approach.
Axolotl / Unsloth	solid	Config-driven (Axolotl) / 2× faster + low-mem (Unsloth) fine-tuning wrappers.
Hugging Face (Transformers/Datasets/Hub)	★ default	The model + dataset hub and libraries everything builds on.

A sensible default stack

Access: LiteLLM (multi-provider, budgets, fallbacks).
RAG: LlamaIndex + pgvector; embeddings via the provider or a local model.
Agents: LangGraph + MCP tools.
Eval: promptfoo (+ Ragas for RAG) in CI on every change.
Observability: Langfuse (traces, cost, prompts).
Self-host (if needed): vLLM for prod, Ollama for local dev.
Fine-tune (if needed): PEFT/LoRA via Axolotl or Unsloth.

The AI Engineering Toolbox.