Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
GPU Sharing on Kubernetes: MIG vs MPS vs Time-Slicing
AI

GPU Sharing on Kubernetes: MIG vs MPS vs Time-Slicing

Compare NVIDIA GPU sharing strategies for Kubernetes workloads. Multi-Instance GPU, Multi-Process Service, and time-slicing with performance benchmarks and use cases.

LB
Luca Berton
Β· 3 min read

Why Share GPUs?

A single A100 80GB costs $3.25/hour. Many AI workloads don’t need a full GPU:

  • Development/testing β€” needs 10-20% of an A100
  • Small model inference β€” 7B models use 16GB of 80GB
  • Batch preprocessing β€” CPU-bound with occasional GPU burst
  • Multiple small models β€” each needs 10-20GB

Without sharing, you waste 60-80% of GPU capacity. Three technologies solve this:

Comparison Matrix

FeatureMIGMPSTime-Slicing
Isolationβœ… Hardware⚠️ Process-level❌ None
Memory isolationβœ… Guaranteed❌ Shared❌ Shared
Fault isolationβœ… Full❌ Shared failure domain❌ Shared
Supported GPUsA100, H100, H200All NVIDIAAll NVIDIA
Max partitions7 (A100)48 processesUnlimited
Overhead~0%~5%~10-15%
Use caseProduction multi-tenantCooperative workloadsDev/test

Multi-Instance GPU (MIG)

MIG hardware-partitions a GPU into up to 7 isolated instances. Each instance has dedicated compute, memory, and cache β€” like separate physical GPUs.

Supported GPU Profiles (A100 80GB)

ProfileGPU MemoryCompute (SMs)Instances
1g.10gb10GB1/7Up to 7
2g.20gb20GB2/7Up to 3
3g.40gb40GB3/7Up to 2
4g.40gb40GB4/71
7g.80gb80GB7/7 (full)1

Enable MIG on Kubernetes

# GPU Operator configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - device-filter: ["0x20B210DE"]  # A100
          devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      mixed:
        - device-filter: ["0x20B210DE"]
          devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "2g.20gb": 1
            "1g.10gb": 2

Request MIG Slices in Pods

apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
    - name: model
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1  # Request 1 MIG slice (10GB)
---
apiVersion: v1
kind: Pod
metadata:
  name: inference-large
spec:
  containers:
    - name: model
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1  # Request larger slice (40GB)

MIG Best Practices

  • Use 1g.10gb for 7B models (Llama 8B fits in quantized form)
  • Use 3g.40gb for 13-34B models
  • Use 7g.80gb (full GPU) for 70B+ models
  • Cannot mix MIG profiles on same GPU β€” plan ahead
  • Reconfiguring MIG requires GPU reset (brief downtime)

Multi-Process Service (MPS)

MPS allows multiple CUDA processes to share a GPU simultaneously via a shared CUDA context. No hardware partitioning β€” processes cooperatively share compute.

How MPS Works

Without MPS:              With MPS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Process A β”‚ ──time──▢ β”‚ Process A+B β”‚ ──parallel──▢
β”‚   Process B β”‚   slice   β”‚  (shared)   β”‚   execution
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     GPU idle               GPU busy
   between procs           continuously

Enable MPS on Kubernetes

# NVIDIA device plugin config
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
data:
  config.yaml: |
    version: v1
    sharing:
      mps:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 10  # Allow 10 pods to share each GPU

Request Shared GPU

apiVersion: v1
kind: Pod
metadata:
  name: preprocessing
spec:
  containers:
    - name: worker
      resources:
        limits:
          nvidia.com/gpu: 1  # Gets shared access via MPS
        requests:
          nvidia.com/gpu: 1

MPS Limitations

  • No memory isolation β€” one process can OOM and crash all
  • No fault isolation β€” CUDA error affects all processes
  • Best for cooperative workloads you control (same team, same app)
  • Not recommended for multi-tenant production environments

Time-Slicing

The simplest approach: multiple pods take turns using the GPU. The NVIDIA device plugin advertises more GPUs than physically exist.

Configure Time-Slicing

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Each GPU appears as 4 virtual GPUs

How It Works

  • GPU rapidly switches between processes (context switching)
  • Each pod β€œsees” a full GPU but shares time
  • No memory isolation β€” total memory is shared
  • ~10-15% overhead from context switching

Best For

  • Development clusters β€” many developers sharing few GPUs
  • CI/CD pipelines β€” short GPU jobs that don’t need full capacity
  • Non-critical workloads β€” where occasional latency spikes are acceptable

Decision Framework

Choose MIG when:

  • βœ… Production multi-tenant GPU sharing
  • βœ… Need guaranteed memory and compute isolation
  • βœ… Running different customers/teams on same hardware
  • βœ… Using A100/H100/H200 GPUs
  • βœ… Security requirements demand isolation

Choose MPS when:

  • βœ… Multiple cooperative processes from same application
  • βœ… Batch inference with many small models
  • βœ… Need higher GPU utilization without hardware partitioning
  • βœ… Processes are trusted (same team/namespace)
  • βœ… Using older GPUs without MIG support (V100, T4)

Choose Time-Slicing when:

  • βœ… Development and testing environments
  • βœ… CI/CD GPU access for unit tests
  • βœ… Budget constraints (any GPU works)
  • βœ… Workloads are bursty (idle most of the time)
  • βœ… Simplicity over performance

Cost Savings Example

Scenario: 4 teams each running a 7B model inference service

StrategyGPUs NeededMonthly Cost (A100)
No sharing4x A100$9,490
Time-slicing (4x)1x A100$2,373
MPS (4 procs)1x A100$2,373
MIG (4x 1g.10gb)1x A100$2,373

All three save 75% β€” but MIG gives production-grade isolation at the same cost.

Free 30-min AI & Cloud consultation

Book Now