// AI NATIVE STACK

AI Native › Inference › Framework › TensorRT-LLM

CRASH COURSE · AI-NATIVE · advanced · 10 min read · NVIDIA

TensorRT-LLM — compile the model to squeeze every NVIDIA FLOP.

framework ai-native tensorrt-llm nvidia inference

TL;DR — TensorRT-LLM is NVIDIA's library for the fastest possible LLM inference on NVIDIA GPUs. Unlike interpret-at-runtime engines, it compiles your model into an optimized TensorRT engine — fused kernels, low-precision (FP8/FP4 on Hopper/Blackwell), and in-flight batching — for top-tier latency and throughput. The trade: a build step and NVIDIA-only lock-in.

What it is

TensorRT-LLM is an open-source library that takes a model definition and builds a hardware-tuned inference engine for a specific NVIDIA GPU. It pairs with NVIDIA's serving stack (Triton Inference Server, or the newer trtllm-serve) to expose an API. In the AI Native landscape it's Inference › Framework — the option you reach for when you've standardized on NVIDIA hardware and want the absolute ceiling on performance.

Why it exists

Generic engines run the same model code on any GPU. TensorRT-LLM instead specializes: it knows the exact GPU's tensor cores, memory hierarchy, and supported precisions, and compiles fused, hardware-specific kernels ahead of time. That ahead-of-time specialization is where the extra performance over runtime-interpreted engines comes from — especially the low-precision formats only NVIDIA's newest silicon supports.

How it works — build then serve

The workflow has a distinct compile phase. You convert weights and build an engine for your target GPU + precision + batch settings, then load that engine to serve. The engine is GPU-specific — built for an H100, it won't run optimally (or at all) on a different architecture.

HF weights build engine(FP8, fuse, GPU-tuned) .engine file serve

Fig 1 — Compile once for a target GPU, then serve the optimized engine.

What it brings

  • Low precision — FP8 and FP4 on Hopper/Blackwell, plus INT8/INT4 weight quantization, for big memory + speed wins.
  • In-flight (continuous) batching — via the Triton backend / executor.
  • Kernel fusion & optimized attention — FlashAttention-class kernels, paged KV cache.
  • Multi-GPU / multi-node — tensor and pipeline parallelism for huge models.
  • Speculative decoding, LoRA, broad model coverage (LLMs + multimodal).

Quick start

The high-level tensorrt_llm Python API hides much of the build; trtllm-serve exposes an OpenAI-compatible endpoint:

pip install tensorrt_llm     # NVIDIA GPU + CUDA required
trtllm-serve meta-llama/Llama-3.1-8B-Instruct   # builds + serves on :8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hi"}]}'

For production, the Triton Inference Server with the TensorRT-LLM backend adds metrics, dynamic batching, and multi-model hosting.

When to use, when to skip

Use it when you're all-in on NVIDIA GPUs and need maximum throughput/lowest latency — large-scale production serving, latency SLAs, or to exploit FP8/FP4 on Hopper/Blackwell.

Skip it if you value portability, fast iteration, or non-NVIDIA hardware — the build step and GPU-specific engines add operational friction. For quick self-hosting, vLLM or SGLang are simpler and close in many cases.

heads up Engines are built for a specific GPU, precision, and (often) max batch/sequence shape. Change hardware or bump those limits and you rebuild. Bake the build into CI so deploys stay reproducible.

vs the alternatives

ToolBest forTrade-off
TensorRT-LLMMax speed on NVIDIA GPUsBuild step, NVIDIA-only
vLLMPortable high-throughput servingSlightly less peak on NVIDIA
SGLangPrefix-heavy / structured servingBenchmark per workload
TGIHugging Face-native servingDifferent perf profile

References

Extra reads

Verified against NVIDIA's TensorRT-LLM docs, June 2026.

← AI Native Stack
© cvam — written in plaintext, served warm