TL;DR — KServe turns "serve this model" into one Kubernetes resource. InferenceService handles classic ML (scikit-learn, PyTorch, TF, XGBoost); the newer LLMInferenceService handles GenAI with OpenAI-compatible endpoints, token streaming, KV-cache-aware routing, and disaggregated prefill/decode via llm-d + vLLM. It wraps autoscaling, networking, canary, and health so you don't hand-build them.
What it is
KServe is a standardized, distributed inference platform for Kubernetes — for both predictive and generative AI. You declare an inference service; KServe encapsulates the messy parts (autoscaling including scale-to-zero, ingress/networking, health checks, canary rollout, multi-framework runtimes) behind that one resource. It's a CNCF project and sits in AI Native Infra › Workload Runtime.
Why it exists
Wrapping a model in a Flask app is easy; making it production-grade is not — you need autoscaling to GPU, zero-downtime canary deploys, request/response logging, batching, and a stable API. KServe standardizes all of that so every model is deployed the same way, regardless of framework, instead of each team reinventing a serving stack.
Two tracks: predictive vs generative
KServe now runs a dual-track model:
| Resource | For |
|---|---|
InferenceService | Predictive AI — traditional ML models (sklearn, XGBoost, PyTorch, TensorFlow) via prepackaged serving runtimes. |
LLMInferenceService | Generative AI — LLMs, purpose-built (KServe v0.16+): OpenAI-compatible APIs, streaming, prompt templating, native LLM-runtime integration. |
Fig 1 — One platform, two resources: predictive ML and purpose-built LLM serving.
The LLM serving features
- OpenAI-compatible endpoints —
/v1/chat/completionswith streaming + multi-turn, so OpenAI SDKs, LangChain, LlamaIndex, and GenAI gateways just work. - KV-cache-aware scheduling — route requests to the replica already holding the relevant cache.
- Disaggregated prefill/decode — via llm-d, split the compute-heavy prefill phase and latency-sensitive decode phase onto different GPU pools for better cost/latency.
- vLLM runtime — high-performance engine underneath (Hugging Face serving runtime or llm-d).
Quick start
Install KServe (Serverless or RawDeployment mode), then apply an InferenceService — KServe pulls the model and stands up an autoscaling endpoint:
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.16/hack/quick_install.sh" | bash
kubectl apply -f sklearn-isvc.yaml # an InferenceService
kubectl get inferenceservices # wait for READY, get the URL
For an LLM you apply an LLMInferenceService pointing at a Hugging Face model, then call its OpenAI-compatible /v1/chat/completions endpoint.
When to use, when to skip
Use it when you serve many models (mixed classic ML + LLMs) on Kubernetes and want one standardized, autoscaling, canary-capable platform — especially if you want scale-to-zero for spiky traffic or the advanced LLM features (KV-aware routing, prefill/decode split).
Skip it for a single model on a single box — running vLLM or Ollama directly is simpler. If your app is already a Ray app, Ray Serve may fit more naturally than adding KServe.
vs the alternatives
| Tool | Best for | Trade-off |
|---|---|---|
| KServe | Standardized multi-model serving (ML + LLM) on K8s | Stack to operate (Knative/llm-d) |
| vLLM | Raw high-throughput LLM engine | Not a full serving platform |
| Ray Serve | Serving inside a Ray app | Ties you to Ray |
| Ollama | Local / single-box model serving | Not cluster-scale |
References
- KServe docs — concepts & runtimes.
- LLMInferenceService — the GenAI track.
- Deploy your first LLM — getting started.
- kserve/kserve — source (CNCF).
Extra reads
- KServe + llm-d — optimized GenAI inference.
- vLLM + KServe — the engine integration.
- Text generation task — LLM serving walkthrough.
Verified against the official KServe docs (kserve.github.io), May 2026. Targets v0.16+.