Three tools, three use cases. vLLM is a production inference engine. TGI is Hugging Face’s optimized server. Ollama is a desktop app for running models locally. They overlap, but they target different audiences.
Quick comparison
| Feature | vLLM | TGI | Ollama |
|---|---|---|---|
| Primary use case | Production inference at scale | Production inference (HF ecosystem) | Local development and experimentation |
| Key innovation | PagedAttention | Flash Attention, speculation | One-command model download |
| API | OpenAI-compatible | HF-compatible + OpenAI | OpenAI-compatible |
| GPU support | NVIDIA, AMD (ROCm) | NVIDIA, AMD, Intel (Gaudi) | NVIDIA, Apple Silicon (Metal), CPU |
| Multi-GPU | Tensor parallelism, pipeline parallel | Tensor parallelism | No |
| Quantization | AWQ, GPTQ, FP8, GGUF | AWQ, GPTQ, EETQ, FP8 | GGUF (native) |
| Continuous batching | Yes | Yes | No |
| Max throughput | Highest | High | Low-medium |
| Ease of use | Moderate (Python/Docker) | Moderate (Docker) | Very easy (desktop app) |
| Kubernetes | Helm charts, KServe | Helm charts, HF Inference Endpoints | Not designed for K8s |
| License | Apache 2.0 | Apache 2.0 | MIT |
Architecture
vLLM
vLLM’s core innovation is PagedAttention — managing KV cache like virtual memory pages:
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9- KV cache is non-contiguous — no memory waste from fragmentation
- Continuous batching adds new requests without waiting for current batch
- Prefix caching for repeated system prompts
- Speculative decoding for faster generation
TGI (Text Generation Inference)
TGI is Hugging Face’s production server with Flash Attention and token streaming:
# Start TGI server
docker run --gpus all -p 8080:80 \
-v /models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3-8B-Instruct \
--num-shard 2 \
--max-input-length 4096 \
--max-total-tokens 8192- Flash Attention 2 for memory-efficient attention
- Speculation for faster generation
- Grammar-constrained generation (JSON output)
- Watermarking support
- Native Hugging Face Hub integration
Ollama
Ollama is a desktop application that downloads and runs models with one command:
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically)
ollama run llama3
# API server (OpenAI-compatible)
ollama serve
curl http://localhost:11434/v1/chat/completions \
-d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]}'- GGUF quantization (runs large models on consumer hardware)
- Apple Silicon Metal acceleration
- CPU-only inference for machines without GPUs
- Modelfile for custom model configurations
- Automatic model management (download, update, delete)
Performance benchmarks
Measured on 1x A100 80GB, Llama 3 8B Instruct, 128 concurrent users:
| Metric | vLLM | TGI | Ollama |
|---|---|---|---|
| Throughput (tokens/s) | 4,200 | 3,800 | 120 |
| p50 TTFT | 45 ms | 52 ms | 890 ms |
| p99 TTFT | 180 ms | 210 ms | 3,200 ms |
| p50 ITL | 12 ms | 14 ms | 35 ms |
| Max concurrent | 256+ | 256+ | 1-4 |
| GPU memory efficiency | 95% | 90% | 70% |
TTFT = Time to First Token. ITL = Inter-Token Latency.
vLLM leads in throughput due to PagedAttention’s superior memory utilization. Ollama is not designed for concurrent serving — it excels at single-user interactive use.
GPU requirements
Running Llama 3 8B
| Setup | vLLM | TGI | Ollama |
|---|---|---|---|
| FP16 | 1x GPU (16 GB+) | 1x GPU (16 GB+) | 1x GPU (16 GB+) |
| INT8/AWQ | 1x GPU (8 GB+) | 1x GPU (8 GB+) | N/A |
| Q4_K_M (GGUF) | N/A (recent support) | N/A | CPU or 6 GB GPU |
Running Llama 3 70B
| Setup | vLLM | TGI | Ollama |
|---|---|---|---|
| FP16 | 2x A100 80GB | 2x A100 80GB | Not practical |
| AWQ/INT4 | 1x A100 80GB | 1x A100 80GB | 1x GPU (48 GB+) |
| Q4_K_M (GGUF) | N/A | N/A | 48 GB RAM (CPU, slow) |
Kubernetes deployment
vLLM on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3-8B-Instruct
- --tensor-parallel-size=1
- --gpu-memory-utilization=0.9
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000TGI on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-llama3
spec:
replicas: 2
template:
spec:
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:latest
args:
- --model-id=meta-llama/Llama-3-8B-Instruct
- --num-shard=1
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 80Both work well with NVIDIA NIM, KServe, and HPA for autoscaling.
Decision guide
Choose vLLM when:
- Production inference at scale — highest throughput, best memory efficiency
- You need multi-GPU serving (tensor/pipeline parallelism)
- OpenAI-compatible API is required (drop-in replacement)
- You are deploying on Kubernetes or OpenShift
- Maximum tokens per second per dollar is the goal
Choose TGI when:
- You are in the Hugging Face ecosystem (Hub, Inference Endpoints)
- You need grammar-constrained generation (guaranteed JSON output)
- Intel Gaudi hardware support is needed
- You want managed deployment via Hugging Face Inference Endpoints
- Token watermarking is a requirement
Choose Ollama when:
- Local development and experimentation on your laptop
- You want to run models on Apple Silicon (M1/M2/M3/M4) with Metal
- No GPU available — Ollama runs on CPU with GGUF quantization
- You want the simplest possible setup — one command to start
- Privacy — all inference runs locally, no data leaves your machine
- Building prototypes before deploying to production with vLLM/TGI