The Serving Stack Decision
Every production LLM deployment needs a serving framework. The three leaders:
| Framework | Creator | Best For | License |
|---|---|---|---|
| vLLM | UC Berkeley | Max throughput, open models | Apache 2.0 |
| Triton | NVIDIA | Multi-framework, multi-model | BSD |
| NIM | NVIDIA | Enterprise turnkey deployment | Proprietary |
vLLM: Maximum Open-Source Throughput
vLLM pioneered PagedAttention β treating KV-cache like virtual memory pages for near-optimal GPU utilization.
Deployment on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama70b
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.4
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size"
- "2"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.92"
- "--enable-prefix-caching"
- "--max-num-seqs"
- "64"
- "--dtype"
- "auto"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "2"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10vLLM Strengths
- PagedAttention β 24x higher throughput than HuggingFace Transformers
- Continuous batching β dynamic batch formation for optimal GPU use
- Tensor parallelism β split models across GPUs seamlessly
- OpenAI-compatible API β drop-in replacement
- Speculative decoding β use draft model for faster generation
- Prefix caching β share KV-cache for common system prompts
- Quantization β AWQ, GPTQ, FP8 native support
vLLM Limitations
- Single-framework (PyTorch only)
- No built-in model management
- No multi-model serving (one model per process)
- Limited ensemble support
Triton Inference Server: Multi-Model Platform
Triton serves any model from any framework with a unified API. Itβs an inference platform, not just an LLM server.
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββ
β Triton Inference Server β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Model A β β Model B β β Model C β β
β β(PyTorch) β β(TensorRT)β β (ONNX) β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Dynamic Batching Scheduler β β
β ββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Model Repository (S3/GCS/local) β β
β ββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββDeployment with vLLM Backend
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-llm
spec:
template:
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3
args:
- "tritonserver"
- "--model-repository=/models"
- "--model-control-mode=poll"
- "--repository-poll-secs=30"
volumeMounts:
- name: model-repo
mountPath: /models
resources:
limits:
nvidia.com/gpu: "2"Model Repository Structure
models/
βββ llama-70b/
β βββ config.pbtxt
β βββ 1/
β β βββ model.json
β βββ llama-70b-config.yaml
βββ embedding-model/
β βββ config.pbtxt
β βββ 1/
β βββ model.onnx
βββ reranker/
βββ config.pbtxt
βββ 1/
βββ model.ptTriton Strengths
- Multi-model β serve LLM + embeddings + reranker in one server
- Multi-framework β PyTorch, TensorRT, ONNX, TensorFlow, vLLM
- Model ensembles β chain models (embed β retrieve β rerank β generate)
- Dynamic batching β framework-agnostic request batching
- Model versioning β hot-swap models without downtime
- Metrics β Prometheus metrics out of the box
Triton Limitations
- Complex configuration β
config.pbtxtfiles for every model - Steeper learning curve β model repository structure is rigid
- LLM throughput β slightly lower than native vLLM (abstraction overhead)
- Resource overhead β heavier base footprint
NVIDIA NIM: Enterprise Turnkey
NIM packages optimized models as containerized microservices. Pull, run, serve.
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama70b
spec:
template:
spec:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: api-key
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "2"Thatβs it. No model download, no config files, no optimization tuning.
NIM Strengths
- Zero configuration β pre-optimized for each GPU type
- TensorRT-LLM backend β maximum inference performance
- Model profiles β automatic GPU memory/throughput optimization
- Enterprise support β NVIDIA AI Enterprise license includes SLA
- Security β signed containers, vulnerability scanning
NIM Limitations
- Proprietary β requires NVIDIA AI Enterprise license ($$$)
- Limited model selection β only NGC catalog models
- No customization β canβt modify serving behavior
- Vendor lock-in β NVIDIA GPUs only
Performance Benchmarks
Llama 3.1 70B on 2x A100 80GB, 32 concurrent requests:
| Metric | vLLM | Triton+vLLM | NIM |
|---|---|---|---|
| Throughput (tok/s) | 2,100 | 1,950 | 2,400 |
| TTFT P50 | 180ms | 210ms | 150ms |
| TTFT P99 | 450ms | 520ms | 380ms |
| ITL P50 | 28ms | 32ms | 24ms |
| Memory efficiency | 92% | 88% | 95% |
| Setup time | 10 min | 30 min | 2 min |
NIM wins on raw performance (TensorRT-LLM optimization), vLLM wins on flexibility and cost.
Decision Framework
Choose vLLM when:
- β Open-source models (Llama, Mistral, Qwen)
- β Maximum flexibility and customization
- β Budget-conscious (no license fees)
- β Need speculative decoding or experimental features
- β Self-hosted, full control
Choose Triton when:
- β Serving multiple model types (LLM + embedding + classifier)
- β Model ensemble pipelines (RAG, reranking)
- β Mixed framework environment (ONNX + PyTorch + TensorRT)
- β Need model versioning and A/B testing
- β Existing NVIDIA infrastructure
Choose NIM when:
- β Enterprise with NVIDIA AI Enterprise license
- β Fastest time-to-production
- β Team lacks ML infrastructure expertise
- β Need vendor support and SLAs
- β Compliance requires signed, scanned containers
Cost Comparison (Monthly, 2x A100 80GB)
| Component | vLLM | Triton | NIM |
|---|---|---|---|
| GPU compute | $4,745 | $4,745 | $4,745 |
| Software license | $0 | $0 | ~$4,500 (NVAIE) |
| Engineering time | 40h setup | 60h setup | 4h setup |
| Total Year 1 | $59,940 | $63,540 | $110,940 |
For teams with ML platform engineers, vLLM is the clear cost winner. For teams without, NIMβs premium pays for itself in engineering time savings.