// AI NATIVE STACK

AI Native › Inference › Runtime › LMDeploy

CRASH COURSE · AI-NATIVE · intermediate · 9 min read · TurboMind

LMDeploy — quantize hard, serve fast, from the InternLM team.

runtime ai-native lmdeploy quantization inference

TL;DR — LMDeploy is a compression + serving toolkit from the InternLM / OpenMMLab community. Its TurboMind engine delivers high request throughput via persistent batching, a blocked KV cache, and deeply optimized CUDA kernels — and it has best-in-class weight quantization (4-bit AWQ / W4A16) that keeps accuracy while cutting memory. OpenAI-compatible server in one command.

What it is

LMDeploy is an open-source toolkit for compressing, deploying, and serving LLMs and vision-language models. It ships two backends: TurboMind (a high-performance C++/CUDA engine) and a PyTorch backend (broader model coverage, easier to extend). In the AI Native landscape it's Inference › Runtime, and it's especially strong where quantization matters.

Why it exists

It grew out of serving the InternLM model family at scale, where two things mattered most: squeezing models into less GPU memory without wrecking quality, and pushing maximum requests/second. LMDeploy bundles aggressive, accuracy-preserving quantization with a hand-tuned inference engine to hit both — often topping throughput comparisons on NVIDIA GPUs.

What's in the box

  • TurboMind engine — persistent (continuous) batching, blocked KV cache, fused CUDA kernels.
  • Quantization — 4-bit weight-only AWQ (W4A16) and KV-cache quantization; big memory savings, small quality hit.
  • Tensor parallelism — multi-GPU for large models.
  • OpenAI-compatible API server plus a Python pipeline API.
  • VLM support — serve vision-language models, not just text.
FP16 model AWQ W4A16(4-bit weights) TurboMindpersistent batching API

Fig 1 — Quantize with AWQ, then serve through the TurboMind engine.

Quick start

Install, then serve any supported model as an OpenAI-compatible endpoint:

pip install lmdeploy
lmdeploy serve api_server meta-llama/Llama-3.1-8B-Instruct   # :23333

curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hi"}]}'

# quantize to 4-bit AWQ first for big memory savings
lmdeploy lite auto_awq meta-llama/Llama-3.1-8B-Instruct --work-dir llama3-awq

Add --tp 2 for 2-GPU tensor parallelism; serve the llama3-awq dir to run the quantized model.

When to use, when to skip

Use it when you serve on NVIDIA GPUs and want strong throughput plus aggressive, accuracy-preserving quantization to fit bigger models in less VRAM — or when serving InternLM / vision-language models it supports well.

Skip it for non-NVIDIA hardware or laptop-local use (llama.cpp / Ollama). Benchmark against vLLM and SGLang — the throughput leader shifts by model and workload.

heads up TurboMind has the best performance but narrower model coverage than the PyTorch backend. If your model isn't TurboMind-supported, LMDeploy falls back to PyTorch — fast, but not the headline numbers. Check the support matrix first.

vs the alternatives

ToolBest forTrade-off
LMDeployQuantized high-throughput NVIDIA servingTurboMind model coverage
vLLMGeneral high-throughput servingBenchmark per workload
TensorRT-LLMNVIDIA peak performanceBuild step
llama.cppLocal / CPU / edgeLow concurrency

References

Extra reads

Verified against the official LMDeploy docs (lmdeploy.readthedocs.io), June 2026.

← AI Native Stack
© cvam — written in plaintext, served warm