Curated LLM/ML engineering stack — frameworks, serving, retrieval, eval, observability, agents,
fine-tuning. Verdict tags: ★ default pick, solid,
niche, commercial. Principle: thin, observable, eval-driven
beats a heavy framework you can't debug.
App frameworks / orchestration
| Tool | Verdict | Review |
| LlamaIndex | ★ (RAG) | Best-in-class for retrieval/RAG data plumbing — loaders, indexes, query engines. Reach for it when the problem is "data → context". |
| LangChain / LangGraph | solid | Huge ecosystem; LangChain can be over-abstracted — prefer LangGraph for explicit, stateful agent graphs. Don't hide your prompts behind magic. |
| DSPy | niche/★ | Program + compile prompts (optimize against a metric) instead of hand-tuning. Powerful for serious pipelines; steeper learning curve. |
| Plain SDK + functions | ★ default | For simple apps, the provider SDK + your own code beats any framework — fewer abstractions to debug. |
framework ≠ requirement
Many "LLM apps" are a prompt + a function call + a retriever. Reach for a framework when you
need orchestration/state/eval — not by default. You must always be able to see the exact prompt
sent to the model.
Gateways & provider access
| Tool | Verdict | Review |
| LiteLLM | ★ default | One OpenAI-compatible API across 100+ providers + a proxy with keys, budgets, fallbacks, logging. The default abstraction layer. |
| Vercel AI Gateway / OpenRouter | solid | Hosted multi-provider routing + observability + fallbacks. Fast to adopt. |
| Vercel AI SDK | solid | TS-first streaming/tool-calling SDK for web apps; provider-agnostic via the gateway. |
Self-hosted serving / inference
| Tool | Verdict | Review |
| vLLM | ★ default | High-throughput serving (PagedAttention, continuous batching). The default for self-hosting open models at scale. |
| Ollama | ★ (local) | Dead-simple local model running for dev/prototyping. Not for high-throughput prod. |
| TGI / TensorRT-LLM | solid | HF Text Generation Inference; TensorRT-LLM for max NVIDIA perf (harder to operate). |
| llama.cpp | niche | CPU/edge/quantized (GGUF) inference. Great for laptops/embedded. |
Vector stores / retrieval
| Tool | Verdict | Review |
| pgvector (Postgres) | ★ default | Vectors in the DB you already run — transactions, joins, one less system. Default until scale demands more. |
| Qdrant | ★ solid | Fast, open, easy filtering + hybrid search. Best dedicated open vector DB. |
| Weaviate / Milvus | solid | Feature-rich / built for billion-scale (Milvus). More to operate. |
| Pinecone | commercial | Managed, zero-ops, scales well. Pay to not run infra. |
start with pgvector
Don't add a dedicated vector DB on day one. pgvector in your existing Postgres handles most
workloads with transactions + metadata filtering. Graduate to Qdrant/Milvus only at real scale.
Evaluation & testing
| Tool | Verdict | Review |
| promptfoo | ★ default | Declarative prompt/RAG eval + red-teaming in CI — compare prompts/models against test cases. The default open eval. |
| Ragas | solid | RAG-specific metrics (faithfulness, context precision/recall). Pair with promptfoo. |
| LangSmith / Langfuse evals | solid | Dataset + eval runs tied to traces. Langfuse is open-source. |
no evals = flying blind
LLM changes (prompt, model, temperature) regress silently. Build an eval set early and run it in
CI on every change — "looks good in the demo" is not a test.
Observability & tracing
| Tool | Verdict | Review |
| Langfuse | ★ default | Open-source LLM tracing, costs, prompt management, evals. Self-host or cloud. Default for observability. |
| LangSmith | solid | Polished tracing/eval (LangChain-native, works standalone). Commercial. |
| Arize Phoenix | solid | Open tracing + eval + drift, OpenTelemetry-based. |
Agents & tool use
| Tool | Verdict | Review |
| LangGraph | ★ default | Explicit state-machine/graph agents — controllable, debuggable, supports human-in-the-loop + persistence. Default for real agents. |
| CrewAI / AutoGen | solid | Multi-agent collaboration frameworks. Fast to demo; watch token cost + loops. |
| MCP (Model Context Protocol) | ★ (standard) | Open standard for tool/data connectors to LLMs. Increasingly the way to expose tools to agents. |
Fine-tuning & training
| Tool | Verdict | Review |
| PEFT / LoRA (HF) | ★ default | Parameter-efficient fine-tuning — adapt a model on one GPU. The default approach. |
| Axolotl / Unsloth | solid | Config-driven (Axolotl) / 2× faster + low-mem (Unsloth) fine-tuning wrappers. |
| Hugging Face (Transformers/Datasets/Hub) | ★ default | The model + dataset hub and libraries everything builds on. |
A sensible default stack
- Access: LiteLLM (multi-provider, budgets, fallbacks).
- RAG: LlamaIndex + pgvector; embeddings via the provider or a local model.
- Agents: LangGraph + MCP tools.
- Eval: promptfoo (+ Ragas for RAG) in CI on every change.
- Observability: Langfuse (traces, cost, prompts).
- Self-host (if needed): vLLM for prod, Ollama for local dev.
- Fine-tune (if needed): PEFT/LoRA via Axolotl or Unsloth.