// AI NATIVE STACK

AI Native › AI Native Infra › Workload Runtime › KServe

CRASH COURSE · AI-NATIVE · intermediate · 11 min read · v0.16

KServe — model serving as a single Kubernetes resource.

workload-runtime ai-native kserve inference kubernetes

TL;DR — KServe turns "serve this model" into one Kubernetes resource. InferenceService handles classic ML (scikit-learn, PyTorch, TF, XGBoost); the newer LLMInferenceService handles GenAI with OpenAI-compatible endpoints, token streaming, KV-cache-aware routing, and disaggregated prefill/decode via llm-d + vLLM. It wraps autoscaling, networking, canary, and health so you don't hand-build them.

What it is

KServe is a standardized, distributed inference platform for Kubernetes — for both predictive and generative AI. You declare an inference service; KServe encapsulates the messy parts (autoscaling including scale-to-zero, ingress/networking, health checks, canary rollout, multi-framework runtimes) behind that one resource. It's a CNCF project and sits in AI Native Infra › Workload Runtime.

Why it exists

Wrapping a model in a Flask app is easy; making it production-grade is not — you need autoscaling to GPU, zero-downtime canary deploys, request/response logging, batching, and a stable API. KServe standardizes all of that so every model is deployed the same way, regardless of framework, instead of each team reinventing a serving stack.

Two tracks: predictive vs generative

KServe now runs a dual-track model:

ResourceFor
InferenceServicePredictive AI — traditional ML models (sklearn, XGBoost, PyTorch, TensorFlow) via prepackaged serving runtimes.
LLMInferenceServiceGenerative AI — LLMs, purpose-built (KServe v0.16+): OpenAI-compatible APIs, streaming, prompt templating, native LLM-runtime integration.
KServeautoscale·canary InferenceService → ML LLMInferenceServiceOpenAI API · streaming llm-d + vLLMprefill / decode

Fig 1 — One platform, two resources: predictive ML and purpose-built LLM serving.

The LLM serving features

  • OpenAI-compatible endpoints/v1/chat/completions with streaming + multi-turn, so OpenAI SDKs, LangChain, LlamaIndex, and GenAI gateways just work.
  • KV-cache-aware scheduling — route requests to the replica already holding the relevant cache.
  • Disaggregated prefill/decode — via llm-d, split the compute-heavy prefill phase and latency-sensitive decode phase onto different GPU pools for better cost/latency.
  • vLLM runtime — high-performance engine underneath (Hugging Face serving runtime or llm-d).

Quick start

Install KServe (Serverless or RawDeployment mode), then apply an InferenceService — KServe pulls the model and stands up an autoscaling endpoint:

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.16/hack/quick_install.sh" | bash
kubectl apply -f sklearn-isvc.yaml        # an InferenceService
kubectl get inferenceservices             # wait for READY, get the URL

For an LLM you apply an LLMInferenceService pointing at a Hugging Face model, then call its OpenAI-compatible /v1/chat/completions endpoint.

When to use, when to skip

Use it when you serve many models (mixed classic ML + LLMs) on Kubernetes and want one standardized, autoscaling, canary-capable platform — especially if you want scale-to-zero for spiky traffic or the advanced LLM features (KV-aware routing, prefill/decode split).

Skip it for a single model on a single box — running vLLM or Ollama directly is simpler. If your app is already a Ray app, Ray Serve may fit more naturally than adding KServe.

heads up Classic KServe Serverless mode pulls in Knative + a network layer — real components to operate. RawDeployment mode is lighter if you don't need scale-to-zero. And the LLM path (LLMInferenceService/llm-d) is newer than the predictive path — check version support for the exact feature you need.

vs the alternatives

ToolBest forTrade-off
KServeStandardized multi-model serving (ML + LLM) on K8sStack to operate (Knative/llm-d)
vLLMRaw high-throughput LLM engineNot a full serving platform
Ray ServeServing inside a Ray appTies you to Ray
OllamaLocal / single-box model servingNot cluster-scale

References

Extra reads

Verified against the official KServe docs (kserve.github.io), May 2026. Targets v0.16+.

← AI Native Stack
© cvam — written in plaintext, served warm