Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
AI Model Serving on K8s: vLLM vs Triton vs NIM (2026)
AI

AI Model Serving on K8s: vLLM vs Triton vs NIM (2026)

Compare vLLM, Triton Inference Server, and NVIDIA NIM for serving LLMs on Kubernetes. Throughput benchmarks, deployment patterns, and production configuration.

LB
Luca Berton
Β· 3 min read

The Serving Stack Decision

Every production LLM deployment needs a serving framework. The three leaders:

FrameworkCreatorBest ForLicense
vLLMUC BerkeleyMax throughput, open modelsApache 2.0
TritonNVIDIAMulti-framework, multi-modelBSD
NIMNVIDIAEnterprise turnkey deploymentProprietary

vLLM: Maximum Open-Source Throughput

vLLM pioneered PagedAttention β€” treating KV-cache like virtual memory pages for near-optimal GPU utilization.

Deployment on Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama70b
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.4
          args:
            - "--model"
            - "meta-llama/Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size"
            - "2"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--enable-prefix-caching"
            - "--max-num-seqs"
            - "64"
            - "--dtype"
            - "auto"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10

vLLM Strengths

  • PagedAttention β€” 24x higher throughput than HuggingFace Transformers
  • Continuous batching β€” dynamic batch formation for optimal GPU use
  • Tensor parallelism β€” split models across GPUs seamlessly
  • OpenAI-compatible API β€” drop-in replacement
  • Speculative decoding β€” use draft model for faster generation
  • Prefix caching β€” share KV-cache for common system prompts
  • Quantization β€” AWQ, GPTQ, FP8 native support

vLLM Limitations

  • Single-framework (PyTorch only)
  • No built-in model management
  • No multi-model serving (one model per process)
  • Limited ensemble support

Triton Inference Server: Multi-Model Platform

Triton serves any model from any framework with a unified API. It’s an inference platform, not just an LLM server.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Triton Inference Server          β”‚
β”‚                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Model A β”‚ β”‚  Model B β”‚ β”‚  Model C β”‚  β”‚
β”‚  β”‚(PyTorch) β”‚ β”‚(TensorRT)β”‚ β”‚ (ONNX)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚     Dynamic Batching Scheduler       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚     Model Repository (S3/GCS/local)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deployment with vLLM Backend

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-llm
spec:
  template:
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3
          args:
            - "tritonserver"
            - "--model-repository=/models"
            - "--model-control-mode=poll"
            - "--repository-poll-secs=30"
          volumeMounts:
            - name: model-repo
              mountPath: /models
          resources:
            limits:
              nvidia.com/gpu: "2"

Model Repository Structure

models/
β”œβ”€β”€ llama-70b/
β”‚   β”œβ”€β”€ config.pbtxt
β”‚   β”œβ”€β”€ 1/
β”‚   β”‚   └── model.json
β”‚   └── llama-70b-config.yaml
β”œβ”€β”€ embedding-model/
β”‚   β”œβ”€β”€ config.pbtxt
β”‚   └── 1/
β”‚       └── model.onnx
└── reranker/
    β”œβ”€β”€ config.pbtxt
    └── 1/
        └── model.pt

Triton Strengths

  • Multi-model β€” serve LLM + embeddings + reranker in one server
  • Multi-framework β€” PyTorch, TensorRT, ONNX, TensorFlow, vLLM
  • Model ensembles β€” chain models (embed β†’ retrieve β†’ rerank β†’ generate)
  • Dynamic batching β€” framework-agnostic request batching
  • Model versioning β€” hot-swap models without downtime
  • Metrics β€” Prometheus metrics out of the box

Triton Limitations

  • Complex configuration β€” config.pbtxt files for every model
  • Steeper learning curve β€” model repository structure is rigid
  • LLM throughput β€” slightly lower than native vLLM (abstraction overhead)
  • Resource overhead β€” heavier base footprint

NVIDIA NIM: Enterprise Turnkey

NIM packages optimized models as containerized microservices. Pull, run, serve.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nim-llama70b
spec:
  template:
    spec:
      containers:
        - name: nim
          image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ngc-secret
                  key: api-key
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"

That’s it. No model download, no config files, no optimization tuning.

NIM Strengths

  • Zero configuration β€” pre-optimized for each GPU type
  • TensorRT-LLM backend β€” maximum inference performance
  • Model profiles β€” automatic GPU memory/throughput optimization
  • Enterprise support β€” NVIDIA AI Enterprise license includes SLA
  • Security β€” signed containers, vulnerability scanning

NIM Limitations

  • Proprietary β€” requires NVIDIA AI Enterprise license ($$$)
  • Limited model selection β€” only NGC catalog models
  • No customization β€” can’t modify serving behavior
  • Vendor lock-in β€” NVIDIA GPUs only

Performance Benchmarks

Llama 3.1 70B on 2x A100 80GB, 32 concurrent requests:

MetricvLLMTriton+vLLMNIM
Throughput (tok/s)2,1001,9502,400
TTFT P50180ms210ms150ms
TTFT P99450ms520ms380ms
ITL P5028ms32ms24ms
Memory efficiency92%88%95%
Setup time10 min30 min2 min

NIM wins on raw performance (TensorRT-LLM optimization), vLLM wins on flexibility and cost.

Decision Framework

Choose vLLM when:

  • βœ… Open-source models (Llama, Mistral, Qwen)
  • βœ… Maximum flexibility and customization
  • βœ… Budget-conscious (no license fees)
  • βœ… Need speculative decoding or experimental features
  • βœ… Self-hosted, full control

Choose Triton when:

  • βœ… Serving multiple model types (LLM + embedding + classifier)
  • βœ… Model ensemble pipelines (RAG, reranking)
  • βœ… Mixed framework environment (ONNX + PyTorch + TensorRT)
  • βœ… Need model versioning and A/B testing
  • βœ… Existing NVIDIA infrastructure

Choose NIM when:

  • βœ… Enterprise with NVIDIA AI Enterprise license
  • βœ… Fastest time-to-production
  • βœ… Team lacks ML infrastructure expertise
  • βœ… Need vendor support and SLAs
  • βœ… Compliance requires signed, scanned containers

Cost Comparison (Monthly, 2x A100 80GB)

ComponentvLLMTritonNIM
GPU compute$4,745$4,745$4,745
Software license$0$0~$4,500 (NVAIE)
Engineering time40h setup60h setup4h setup
Total Year 1$59,940$63,540$110,940

For teams with ML platform engineers, vLLM is the clear cost winner. For teams without, NIM’s premium pays for itself in engineering time savings.

Free 30-min AI & Cloud consultation

Book Now