← Toolboxes

TOOLBOX · AI ENGINEERING · CURATED + REVIEWED

The AI Engineering Toolbox.

ai-engineering tools reviews llm
Curated LLM/ML engineering stack — frameworks, serving, retrieval, eval, observability, agents, fine-tuning. Verdict tags: ★ default pick, solid, niche, commercial. Principle: thin, observable, eval-driven beats a heavy framework you can't debug.

App frameworks / orchestration

ToolVerdictReview
LlamaIndex★ (RAG)Best-in-class for retrieval/RAG data plumbing — loaders, indexes, query engines. Reach for it when the problem is "data → context".
LangChain / LangGraphsolidHuge ecosystem; LangChain can be over-abstracted — prefer LangGraph for explicit, stateful agent graphs. Don't hide your prompts behind magic.
DSPyniche/★Program + compile prompts (optimize against a metric) instead of hand-tuning. Powerful for serious pipelines; steeper learning curve.
Plain SDK + functions★ defaultFor simple apps, the provider SDK + your own code beats any framework — fewer abstractions to debug.
framework ≠ requirement Many "LLM apps" are a prompt + a function call + a retriever. Reach for a framework when you need orchestration/state/eval — not by default. You must always be able to see the exact prompt sent to the model.

Gateways & provider access

ToolVerdictReview
LiteLLM★ defaultOne OpenAI-compatible API across 100+ providers + a proxy with keys, budgets, fallbacks, logging. The default abstraction layer.
Vercel AI Gateway / OpenRoutersolidHosted multi-provider routing + observability + fallbacks. Fast to adopt.
Vercel AI SDKsolidTS-first streaming/tool-calling SDK for web apps; provider-agnostic via the gateway.

Self-hosted serving / inference

ToolVerdictReview
vLLM★ defaultHigh-throughput serving (PagedAttention, continuous batching). The default for self-hosting open models at scale.
Ollama★ (local)Dead-simple local model running for dev/prototyping. Not for high-throughput prod.
TGI / TensorRT-LLMsolidHF Text Generation Inference; TensorRT-LLM for max NVIDIA perf (harder to operate).
llama.cppnicheCPU/edge/quantized (GGUF) inference. Great for laptops/embedded.

Vector stores / retrieval

ToolVerdictReview
pgvector (Postgres)★ defaultVectors in the DB you already run — transactions, joins, one less system. Default until scale demands more.
Qdrant★ solidFast, open, easy filtering + hybrid search. Best dedicated open vector DB.
Weaviate / MilvussolidFeature-rich / built for billion-scale (Milvus). More to operate.
PineconecommercialManaged, zero-ops, scales well. Pay to not run infra.
start with pgvector Don't add a dedicated vector DB on day one. pgvector in your existing Postgres handles most workloads with transactions + metadata filtering. Graduate to Qdrant/Milvus only at real scale.

Evaluation & testing

ToolVerdictReview
promptfoo★ defaultDeclarative prompt/RAG eval + red-teaming in CI — compare prompts/models against test cases. The default open eval.
RagassolidRAG-specific metrics (faithfulness, context precision/recall). Pair with promptfoo.
LangSmith / Langfuse evalssolidDataset + eval runs tied to traces. Langfuse is open-source.
no evals = flying blind LLM changes (prompt, model, temperature) regress silently. Build an eval set early and run it in CI on every change — "looks good in the demo" is not a test.

Observability & tracing

ToolVerdictReview
Langfuse★ defaultOpen-source LLM tracing, costs, prompt management, evals. Self-host or cloud. Default for observability.
LangSmithsolidPolished tracing/eval (LangChain-native, works standalone). Commercial.
Arize PhoenixsolidOpen tracing + eval + drift, OpenTelemetry-based.

Agents & tool use

ToolVerdictReview
LangGraph★ defaultExplicit state-machine/graph agents — controllable, debuggable, supports human-in-the-loop + persistence. Default for real agents.
CrewAI / AutoGensolidMulti-agent collaboration frameworks. Fast to demo; watch token cost + loops.
MCP (Model Context Protocol)★ (standard)Open standard for tool/data connectors to LLMs. Increasingly the way to expose tools to agents.

Fine-tuning & training

ToolVerdictReview
PEFT / LoRA (HF)★ defaultParameter-efficient fine-tuning — adapt a model on one GPU. The default approach.
Axolotl / UnslothsolidConfig-driven (Axolotl) / 2× faster + low-mem (Unsloth) fine-tuning wrappers.
Hugging Face (Transformers/Datasets/Hub)★ defaultThe model + dataset hub and libraries everything builds on.

A sensible default stack

  • Access: LiteLLM (multi-provider, budgets, fallbacks).
  • RAG: LlamaIndex + pgvector; embeddings via the provider or a local model.
  • Agents: LangGraph + MCP tools.
  • Eval: promptfoo (+ Ragas for RAG) in CI on every change.
  • Observability: Langfuse (traces, cost, prompts).
  • Self-host (if needed): vLLM for prod, Ollama for local dev.
  • Fine-tune (if needed): PEFT/LoRA via Axolotl or Unsloth.
← prev: Security next: SRE →
© cvam — written in plaintext, served warm