TL;DR — LMDeploy is a compression + serving toolkit from the InternLM / OpenMMLab community. Its TurboMind engine delivers high request throughput via persistent batching, a blocked KV cache, and deeply optimized CUDA kernels — and it has best-in-class weight quantization (4-bit AWQ / W4A16) that keeps accuracy while cutting memory. OpenAI-compatible server in one command.
What it is
LMDeploy is an open-source toolkit for compressing, deploying, and serving LLMs and vision-language models. It ships two backends: TurboMind (a high-performance C++/CUDA engine) and a PyTorch backend (broader model coverage, easier to extend). In the AI Native landscape it's Inference › Runtime, and it's especially strong where quantization matters.
Why it exists
It grew out of serving the InternLM model family at scale, where two things mattered most: squeezing models into less GPU memory without wrecking quality, and pushing maximum requests/second. LMDeploy bundles aggressive, accuracy-preserving quantization with a hand-tuned inference engine to hit both — often topping throughput comparisons on NVIDIA GPUs.
What's in the box
- TurboMind engine — persistent (continuous) batching, blocked KV cache, fused CUDA kernels.
- Quantization — 4-bit weight-only AWQ (W4A16) and KV-cache quantization; big memory savings, small quality hit.
- Tensor parallelism — multi-GPU for large models.
- OpenAI-compatible API server plus a Python pipeline API.
- VLM support — serve vision-language models, not just text.
Fig 1 — Quantize with AWQ, then serve through the TurboMind engine.
Quick start
Install, then serve any supported model as an OpenAI-compatible endpoint:
pip install lmdeploy
lmdeploy serve api_server meta-llama/Llama-3.1-8B-Instruct # :23333
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"...","messages":[{"role":"user","content":"hi"}]}'
# quantize to 4-bit AWQ first for big memory savings
lmdeploy lite auto_awq meta-llama/Llama-3.1-8B-Instruct --work-dir llama3-awq
Add --tp 2 for 2-GPU tensor parallelism; serve the llama3-awq dir to run the quantized model.
When to use, when to skip
Use it when you serve on NVIDIA GPUs and want strong throughput plus aggressive, accuracy-preserving quantization to fit bigger models in less VRAM — or when serving InternLM / vision-language models it supports well.
Skip it for non-NVIDIA hardware or laptop-local use (llama.cpp / Ollama). Benchmark against vLLM and SGLang — the throughput leader shifts by model and workload.
vs the alternatives
| Tool | Best for | Trade-off |
|---|---|---|
| LMDeploy | Quantized high-throughput NVIDIA serving | TurboMind model coverage |
| vLLM | General high-throughput serving | Benchmark per workload |
| TensorRT-LLM | NVIDIA peak performance | Build step |
| llama.cpp | Local / CPU / edge | Low concurrency |
References
- LMDeploy docs — official documentation.
- InternLM/lmdeploy — source.
- Get started — serve a model.
Extra reads
- AWQ W4A16 quantization — the memory win.
- Supported models matrix — TurboMind vs PyTorch.
Verified against the official LMDeploy docs (lmdeploy.readthedocs.io), June 2026.