// AI NATIVE STACK

AI Native › AI Native Infra › Workload Runtime › vLLM

CRASH COURSE · AI-NATIVE · intermediate · 11 min read · V1

vLLM — the inference engine that stopped wasting your GPU.

workload-runtime ai-native vllm inference gpu

TL;DR — vLLM is a high-throughput LLM inference engine. Two ideas make it fast: PagedAttention (treat the KV cache like paged virtual memory, killing fragmentation) and continuous batching (slot new requests into the running batch instead of waiting). Together they deliver up to ~24× the throughput of naive serving. Run it as a Python library or as a drop-in OpenAI-compatible server.

What it is

vLLM is an open-source library and server for fast LLM inference and serving. It loads a model (Hugging Face weights, many architectures) and serves completions at high throughput, exposing an OpenAI-compatible API so existing clients just point at it. Its re-architected V1 engine is the current core. In the AI Native landscape it's in AI Native Infra › Workload Runtime — and it's the engine inside many higher-level serving stacks.

Why it exists

Naive LLM serving leaves GPUs badly underused. The KV cache (per-request attention memory) is allocated in big contiguous chunks, so memory fragments and you can't fit many concurrent requests; and static batching makes finished requests wait for the slowest in the batch. vLLM attacks both — at the memory layer and the scheduler layer — so the same GPU serves far more traffic.

PagedAttention

The KV cache is split into small fixed-size blocks (pages) allocated on demand, like OS virtual memory — instead of one contiguous block per sequence. No pre-reserving for max length, almost no fragmentation, and blocks can even be shared across requests (prefix caching). The result: bigger batches in the same GPU memory.

contiguous KV — reserve for max length, most wasted usedreserved / wasted PagedAttention — small pages, allocated as needed → more sequences fit, bigger batches

Fig 1 — Paging the KV cache removes the reserved-but-unused waste.

Continuous batching

Rather than fixing a batch and waiting for every sequence to finish, vLLM's scheduler runs a rolling batch: the moment one request completes a token-generation slot, a waiting request takes its place. The GPU stays saturated, p50 latency drops, and bursty production traffic is absorbed smoothly — reportedly up to ~23× throughput vs static batching.

More that's in the box

  • OpenAI-compatible server/v1/chat/completions & /v1/completions, streaming.
  • Tensor / pipeline parallelism — shard big models across multiple GPUs.
  • Quantization — FP8/INT8/AWQ/GPTQ to fit more in memory.
  • Prefix caching & chunked prefill — reuse shared prompt prefixes; interleave prefill with decode for steadier latency.
  • Speculative decoding & LoRA — faster decode; serve many adapters on one base model.

Quick start

Install, serve a model with one command, then call it like OpenAI:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct      # OpenAI server on :8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"hi"}]}'

Any OpenAI client works by setting base_url to the vLLM server. For multi-GPU, add --tensor-parallel-size N.

When to use, when to skip

Use it when you self-host LLMs and care about throughput and GPU cost — high-concurrency inference, batch generation, or as the engine behind a serving platform. It's the default high-performance choice for production OSS inference.

Skip it for casual local use on a laptop — Ollama is friendlier. For full cluster lifecycle (autoscaling, canary, multi-model) put vLLM under KServe rather than running it bare. SGLang and TGI are close alternatives worth benchmarking for your workload.

heads up Throughput vs latency is a tradeoff you tune: gpu-memory-utilization, max-num-seqs, and chunked-prefill settings change the balance. And model + GPU memory must actually fit — quantize or shard with tensor parallelism for big models.

vs the alternatives

ToolBest forTrade-off
vLLMHigh-throughput self-hosted LLM servingYou operate the GPU box/cluster
SGLangStructured/agentic workloads, RadixAttentionBenchmark per workload
OllamaEasy local/dev model runningNot built for high concurrency
KServeCluster serving platform (wraps vLLM)More infra to run

References

Extra reads

Verified against the official vLLM docs (docs.vllm.ai), May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm