// AI NATIVE STACK

AI Native › AI Native Infra › Gateway › Envoy AI Gateway

CRASH COURSE · AI-NATIVE · intermediate · 11 min read · v0.6

Envoy AI Gateway — one front door for all your LLM traffic.

gateway ai-native envoy kubernetes llm-ops

TL;DR — Envoy AI Gateway puts a single, OpenAI-compatible endpoint in front of every LLM provider you use. Your apps talk to one URL; the gateway handles provider routing, automatic failover, credential injection, token-based rate limits, and cost visibility. It's a Kubernetes-native layer built on top of Envoy Gateway, driven by a handful of CRDs.

What it is

Envoy AI Gateway is an open-source AI gateway built on Envoy Proxy and Envoy Gateway. It sits between your applications and the GenAI services they call — OpenAI, Anthropic, AWS Bedrock, Google, or self-hosted models like vLLM — and gives clients one unified, OpenAI/Anthropic-compatible API to hit.

It's a CNCF-adjacent Envoy subproject (started by Tetrate and Bloomberg) and reached its first production-ready API surface in v0.6.0, with its CRDs served at v1beta1. In the AI Native landscape it lives in AI Native Infra › Gateway: the traffic-control plane for AI.

Why it exists

Once more than one app calls LLMs, the same problems show up everywhere: API keys scattered across services, no shared rate limits, no failover when a provider has an outage, no idea who's spending how much, and a rewrite every time you switch model providers.

A gateway centralizes all of that. Clients stop knowing or caring which provider answers — they hit one endpoint, and policy, security, and routing live in one place instead of in every codebase.

How it works

The gateway reads the model field from each incoming request, tags it as a header (x-ai-eg-model), and routes on that. Credentials for the chosen backend are injected at the edge, so your app never holds provider keys. Token usage is parsed from the response (OpenAI schema) and fed into rate limits and cost metrics.

apps one endpoint Envoy AI Gateway route · failover auth · rate limit token + cost meter OpenAI Anthropic · Bedrock self-hosted vLLM

Fig 1 — Apps hit one endpoint; the gateway routes, secures, and meters traffic to every provider.

The core CRDs

You configure everything declaratively with Kubernetes resources. Five do the heavy lifting:

CRDWhat it defines
AIGatewayRouteThe unified API entry — match rules that route requests (by model, etc.) to backends.
AIServiceBackendA provider/model target (OpenAI, Bedrock, a vLLM service…).
BackendSecurityPolicyInjects upstream credentials (API keys, cloud auth) into requests securely.
GatewayConfigGateway-wide settings tying it to Envoy Gateway.
MCPRouteRoutes Model Context Protocol traffic to MCP servers.

Because routes stay stable while backends are swapped behind them, you can add, combine, or fail over providers without touching client code.

Key capabilities

  • Unified API — one OpenAI/Anthropic-compatible surface for all providers.
  • Routing & failover — model-aware routing with automatic failover across providers and self-hosted models.
  • Token-based rate limiting — limits on input/output/total tokens per model and per user, via Envoy Gateway's global rate-limit API.
  • Cost & usage visibility — token usage extracted from responses for analytics and budgeting.
  • Backend security — credentials injected at the edge; apps never hold provider keys.
  • MCP routing — first-class support for routing to MCP servers for agent tooling.

Quick start

It rides on Envoy Gateway, so install that first, then the AI Gateway control plane — both as Helm charts:

# 1. Envoy Gateway
helm install eg oci://docker.io/envoyproxy/gateway-helm -n envoy-gateway-system --create-namespace

# 2. Envoy AI Gateway
helm install aieg oci://docker.io/envoyproxy/ai-gateway-helm -n envoy-ai-gateway-system --create-namespace

Then apply the sample route + backend and send an OpenAI-style request at the gateway address:

kubectl apply -f basic.yaml          # AIGatewayRoute + AIServiceBackend + BackendSecurityPolicy
curl $GATEWAY_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'

The model name in the request body is what the route matches on — switch gpt-4o-mini to a Bedrock or self-hosted model and the gateway re-routes, no client change.

When to use, when to skip

Use it when you run on Kubernetes, already use (or want) Envoy/Envoy Gateway, and need centralized routing, security, and token-level rate limiting across many apps and providers. It's the cloud-native, infra-team choice.

Skip it for a single app or a quick prototype — a library like LiteLLM gives you multi-provider routing in-process with far less setup. If you're not on Kubernetes, the operational overhead of Envoy Gateway probably isn't worth it yet.

heads up It's a young project — APIs only stabilized at v0.6 (v1beta1), and it supports a subset of the full OpenAI API. Check the supported-endpoints doc before assuming a specific route exists, and pin chart versions.

vs the alternatives

ToolBest forTrade-off
Envoy AI GatewayK8s-native, Envoy shops, infra-grade policyYoung; needs Envoy Gateway
LiteLLMIn-process multi-provider routing, fast startLibrary, not infra policy plane
kgatewayGateway-API-native, broader API gatewayLess AI-specific tuning
HigressAI-native gateway with rich pluginsDifferent ecosystem (Istio/Higress)

References

Extra reads

Verified against the official Envoy AI Gateway docs (aigateway.envoyproxy.io), May 2026. Targets v0.6+ (v1beta1 CRDs).

← AI Native Stack
© cvam — written in plaintext, served warm