TL;DR — TensorRT-LLM is NVIDIA's library for the fastest possible LLM inference on NVIDIA GPUs. Unlike interpret-at-runtime engines, it compiles your model into an optimized TensorRT engine — fused kernels, low-precision (FP8/FP4 on Hopper/Blackwell), and in-flight batching — for top-tier latency and throughput. The trade: a build step and NVIDIA-only lock-in.
What it is
TensorRT-LLM is an open-source library that takes a model definition and builds a hardware-tuned inference engine for a specific NVIDIA GPU. It pairs with NVIDIA's serving stack (Triton Inference Server, or the newer trtllm-serve) to expose an API. In the AI Native landscape it's Inference › Framework — the option you reach for when you've standardized on NVIDIA hardware and want the absolute ceiling on performance.
Why it exists
Generic engines run the same model code on any GPU. TensorRT-LLM instead specializes: it knows the exact GPU's tensor cores, memory hierarchy, and supported precisions, and compiles fused, hardware-specific kernels ahead of time. That ahead-of-time specialization is where the extra performance over runtime-interpreted engines comes from — especially the low-precision formats only NVIDIA's newest silicon supports.
How it works — build then serve
The workflow has a distinct compile phase. You convert weights and build an engine for your target GPU + precision + batch settings, then load that engine to serve. The engine is GPU-specific — built for an H100, it won't run optimally (or at all) on a different architecture.
Fig 1 — Compile once for a target GPU, then serve the optimized engine.
What it brings
- Low precision — FP8 and FP4 on Hopper/Blackwell, plus INT8/INT4 weight quantization, for big memory + speed wins.
- In-flight (continuous) batching — via the Triton backend / executor.
- Kernel fusion & optimized attention — FlashAttention-class kernels, paged KV cache.
- Multi-GPU / multi-node — tensor and pipeline parallelism for huge models.
- Speculative decoding, LoRA, broad model coverage (LLMs + multimodal).
Quick start
The high-level tensorrt_llm Python API hides much of the build; trtllm-serve exposes an OpenAI-compatible endpoint:
pip install tensorrt_llm # NVIDIA GPU + CUDA required
trtllm-serve meta-llama/Llama-3.1-8B-Instruct # builds + serves on :8000
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"...","messages":[{"role":"user","content":"hi"}]}'
For production, the Triton Inference Server with the TensorRT-LLM backend adds metrics, dynamic batching, and multi-model hosting.
When to use, when to skip
Use it when you're all-in on NVIDIA GPUs and need maximum throughput/lowest latency — large-scale production serving, latency SLAs, or to exploit FP8/FP4 on Hopper/Blackwell.
Skip it if you value portability, fast iteration, or non-NVIDIA hardware — the build step and GPU-specific engines add operational friction. For quick self-hosting, vLLM or SGLang are simpler and close in many cases.
vs the alternatives
| Tool | Best for | Trade-off |
|---|---|---|
| TensorRT-LLM | Max speed on NVIDIA GPUs | Build step, NVIDIA-only |
| vLLM | Portable high-throughput serving | Slightly less peak on NVIDIA |
| SGLang | Prefix-heavy / structured serving | Benchmark per workload |
| TGI | Hugging Face-native serving | Different perf profile |
References
- TensorRT-LLM docs — official documentation.
- NVIDIA/TensorRT-LLM — source.
- Quick-start guide — build & serve.
Extra reads
- NVIDIA TensorRT-LLM blog — perf deep dives.
- Triton TensorRT-LLM backend — production serving.
Verified against NVIDIA's TensorRT-LLM docs, June 2026.