What Is an AI Gateway?
An AI gateway sits between your application and LLM providers, adding:
- Intelligent routing β send requests to the best model/provider
- Load balancing β distribute across GPU pools
- Fallback β automatic failover when a provider is down
- Rate limiting β protect backends from traffic spikes
- Cost tracking β per-request cost attribution
- Caching β semantic deduplication of similar queries
ββββββββββββ βββββββββββββββββββ ββββββββββββββββ
β Client ββββββΆβ AI Gateway ββββββΆβ vLLM Pool β
β Apps β β ββββββΆβ NIM Pool β
β β β β’ Route ββββββΆβ OpenAI API β
β β β β’ Rate limit ββββββΆβ Anthropic β
β β β β’ Cache β ββββββββββββββββ
β β β β’ Track cost β
ββββββββββββ βββββββββββββββββββArchitecture on Kubernetes
Option 1: Envoy + AI-Specific Filters
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ai-gateway
spec:
gatewayClassName: envoy
listeners:
- name: ai-api
protocol: HTTPS
port: 443
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: ai-routes
spec:
parentRefs:
- name: ai-gateway
rules:
# Route to fast model for simple queries
- matches:
- headers:
- name: x-model-tier
value: fast
backendRefs:
- name: vllm-8b
port: 8000
# Route to quality model for complex queries
- matches:
- headers:
- name: x-model-tier
value: quality
backendRefs:
- name: vllm-70b
port: 8000
# Weighted traffic split (canary)
- backendRefs:
- name: vllm-70b
port: 8000
weight: 90
- name: nim-70b
port: 8000
weight: 10Option 2: LiteLLM Proxy
LiteLLM provides a unified OpenAI-compatible API across 100+ LLM providers:
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
spec:
replicas: 3
template:
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
env:
- name: LITELLM_MASTER_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: master-key
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
volumes:
- name: config
configMap:
name: litellm-config# litellm-config.yaml
model_list:
- model_name: "gpt-4-turbo"
litellm_params:
model: "openai/gpt-4-turbo"
api_key: "os.environ/OPENAI_API_KEY"
- model_name: "llama-70b"
litellm_params:
model: "openai/meta-llama/Llama-3.1-70B-Instruct"
api_base: "http://vllm-70b.ai-inference:8000/v1"
api_key: "dummy"
- model_name: "llama-70b" # Second backend for load balancing
litellm_params:
model: "openai/meta-llama/Llama-3.1-70B-Instruct"
api_base: "http://nim-70b.ai-inference:8000/v1"
api_key: "os.environ/NGC_API_KEY"
router_settings:
routing_strategy: "least-busy"
num_retries: 3
timeout: 60
fallbacks:
- model_name: "llama-70b"
fallback: "gpt-4-turbo"
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
database_url: "os.environ/DATABASE_URL"Routing Strategies
| Strategy | Description | Use Case |
|---|---|---|
least-busy | Route to least loaded backend | Even distribution |
simple-shuffle | Random selection | Basic load balancing |
latency-based | Route to fastest responding | Latency-sensitive |
cost-based | Route to cheapest option first | Budget optimization |
usage-based | Rotate based on token usage | Fair sharing |
Rate Limiting
Per-User Token Budget
# Envoy rate limit config
apiVersion: v1
kind: ConfigMap
metadata:
name: ratelimit-config
data:
config.yaml: |
domain: ai-gateway
descriptors:
- key: user_id
rate_limit:
unit: hour
requests_per_unit: 100 # 100 requests/hour per user
- key: user_id
descriptors:
- key: model_tier
value: "quality"
rate_limit:
unit: hour
requests_per_unit: 20 # Only 20 quality-tier requests/hourToken-Based Rate Limiting (LiteLLM)
litellm_settings:
max_budget: 100.0 # $100/month per API key
budget_duration: "1mo"
# Per-model limits
model_max_budget:
gpt-4-turbo: 50.0
llama-70b: 30.0Semantic Caching
Cache responses for semantically similar queries:
import hashlib
from qdrant_client import QdrantClient
cache_db = QdrantClient(host="qdrant-cache.ai-inference.svc")
async def semantic_cache_lookup(query: str, threshold: float = 0.95):
query_embedding = await embed(query)
results = cache_db.search(
collection_name="response_cache",
query_vector=query_embedding,
limit=1,
score_threshold=threshold, # Only return if >95% similar
)
if results:
return results[0].payload["response"] # Cache hit!
return None # Cache miss β proceed to LLMImpact: 20-40% of requests hit cache in production (FAQ, repeated queries, similar phrasings).
Cost Attribution
Track spend per team/project/user:
# Custom metrics for cost tracking
from prometheus_client import Counter, Histogram
tokens_used = Counter(
'ai_gateway_tokens_total',
'Total tokens consumed',
['model', 'team', 'direction'] # direction: input/output
)
request_cost = Histogram(
'ai_gateway_request_cost_dollars',
'Cost per request',
['model', 'team'],
buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)Grafana Cost Dashboard
# Cost per team per day
sum by (team) (increase(ai_gateway_request_cost_dollars_sum[24h]))
# Most expensive model
topk(5, sum by (model) (rate(ai_gateway_request_cost_dollars_sum[1h])) * 3600)
# Token efficiency (output/input ratio)
sum(rate(ai_gateway_tokens_total{direction="output"}[1h])) /
sum(rate(ai_gateway_tokens_total{direction="input"}[1h]))Failover Configuration
# Automatic failover chain
router_settings:
fallbacks:
- model_name: "llama-70b" # Primary: self-hosted
fallback: "gpt-4-turbo" # Fallback: OpenAI
retry_policy:
num_retries: 2
retry_after_seconds: 1
retry_on_status_codes: [429, 500, 502, 503]
# Health check backends
health_check:
enabled: true
interval_seconds: 30
unhealthy_threshold: 3