// AI NATIVE STACK

AI Native › Inference › Framework › SGLang

CRASH COURSE · AI-NATIVE · intermediate · 10 min read · v0.4+

SGLang — fast serving with a radix tree for a memory.

framework ai-native sglang inference gpu

TL;DR — SGLang is a high-performance LLM/VLM serving framework whose signature trick is RadixAttention — it stores KV-cache prefixes in a radix tree so shared prefixes (system prompts, few-shot examples, multi-turn chats) are reused automatically across requests. Pair that with a near zero-overhead scheduler and structured-output support, and it routinely matches or beats other engines on prefix-heavy and agentic workloads. OpenAI-compatible server.

What it is

SGLang is an open-source serving engine for large language and vision-language models. It has two halves: a fast runtime (the server that executes inference) and a frontend language (a Python DSL for expressing complex generation programs — branching, parallelism, structured output). In the AI Native landscape it sits in Inference › Framework, alongside vLLM and TensorRT-LLM, and is widely deployed at scale (notably for large frontier-model serving).

Why it exists

Real workloads repeat prefixes constantly — the same long system prompt on every call, the same few-shot block, the growing history of a chat. Most engines recompute or at best cache that naively. SGLang was built to make prefix reuse a first-class, automatic feature, and to cut the per-step CPU scheduling overhead that throttles throughput at high request rates.

RadixAttention

The KV cache for every request is keyed into a radix tree (a compressed prefix trie). When a new request shares a prefix with something already cached, SGLang reuses those KV blocks instead of recomputing them — across different requests, not just within one. Common system prompts and chat histories become nearly free after the first hit, so throughput climbs sharply on prefix-heavy traffic.

"You are…" (shared) …+ user A turn …+ user B turn …+ user C turn shared prefix computed once, reused by every branch

Fig 1 — RadixAttention shares cached prefix KV across requests via a radix tree.

More that's in the box

  • Zero-overhead scheduler — overlaps CPU scheduling with GPU compute so the GPU rarely stalls between steps.
  • Structured / constrained output — fast JSON-schema and regex-constrained decoding (compressed FSM).
  • Continuous batching, tensor/pipeline parallelism, quantization (FP8, AWQ, GPTQ).
  • Speculative decoding, LoRA, multi-modal (vision-language models).
  • OpenAI-compatible API plus the SGLang frontend DSL for complex programs.

Quick start

Install and launch the OpenAI-compatible server with one command:

pip install "sglang[all]"
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"hi"}]}'

Add --tp 2 for tensor parallelism across 2 GPUs. Any OpenAI client works by pointing base_url at the server.

When to use, when to skip

Use it for prefix-heavy or agentic serving — shared system prompts, long multi-turn chats, structured/JSON output, RAG with repeated context — where RadixAttention pays off. Strong choice at high request rates.

Skip it for laptop/local casual use (Ollama is friendlier) or when you're locked to NVIDIA's fully-tuned stack (TensorRT-LLM). Benchmark against vLLM for your traffic — the winner depends on prefix-reuse rate.

heads up RadixAttention's gain scales with how much your prompts actually share. On all-unique prompts the advantage shrinks toward ordinary continuous batching — measure your real prefix-hit rate before assuming the headline numbers.

vs the alternatives

ToolBest forTrade-off
SGLangPrefix-heavy, structured, agentic servingGain depends on prefix reuse
vLLMGeneral high-throughput servingBenchmark per workload
TensorRT-LLMMax speed on NVIDIA GPUsBuild step, NVIDIA-only
KServeCluster serving platform (wraps engines)More infra

References

Extra reads

Verified against the official SGLang docs (docs.sglang.ai), June 2026.

← AI Native Stack
© cvam — written in plaintext, served warm