Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
AI Model Weight Distribution on Kubernetes
AI

The AI Model Weight Problem

Managing and distributing 100GB+ AI model weights is a hidden bottleneck. ORAS, Harbor, Dragonfly, and ModelPack bring software delivery practices to.

LB
Luca Berton
Β· 2 min read

A quantized LLaMA-3 70B model weighs 140 GB. Frontier multimodal models exceed 1 TB. These are not files you git push β€” and the infrastructure gap between how we manage software containers and how we manage AI models is causing real production problems.

The CNCF community is fixing this with projects like ORAS, Harbor, Dragonfly, and ModelPack.

The Gap

Software containers are pulled from OCI registries with full versioning, security scanning, and rollback support. Model weights are:

  • Downloaded via ad hoc scripts
  • Copied manually between S3 buckets
  • Distributed through unsecured NFS shares
  • Version-controlled with folder names like model-v3-final-FINAL

This creates deployment fragility, security risks, and operational overhead.

Three Approaches (None Sufficient Alone)

ApproachProsCons
Git LFS / Hugging Face HubVersion control, historyPoor for cloud native, Git transport overhead
Object Storage (S3/MinIO)Cloud provider support, vLLM nativeNo structured metadata, weak versioning
Distributed FS (NFS/CephFS)POSIX compatible, low integration costNo versioning, no access control

The Cloud Native Solution: OCI Registries for Models

The key insight: treat model weights like container images. Store them in OCI registries using ORAS (OCI Registry As Storage).

# Push a model to an OCI registry
oras push registry.example.com/models/llama3-70b:v1.0 \
  --artifact-type application/vnd.ai.model.v1 \
  model-weights/ \
  config.json

# Pull on a GPU node
oras pull registry.example.com/models/llama3-70b:v1.0

Benefits

  • Immutability β€” tagged versions never change
  • Security scanning β€” same tools that scan containers
  • Access control β€” registry authentication and RBAC
  • Content-addressable β€” SHA256 integrity verification
  • Replication β€” multi-region distribution via Harbor

Harbor for Model Management

Harbor (CNCF Graduated) manages model artifacts alongside container images:

  • Model versioning with tags
  • Vulnerability scanning of model artifacts
  • Replication across registries for multi-region deployment
  • Access control per model repository
  • Audit logging for compliance

Dragonfly for P2P Distribution

Distributing 140 GB to 100 GPU nodes from a single registry is a bandwidth nightmare. Dragonfly (CNCF Incubating) adds P2P distribution:

Registry β†’ Node1 β†’ Node2 β†’ Node3 (P2P mesh)
                 β†˜ Node4
         Node1 β†’ Node5

Instead of 100 pulls from the registry, the model propagates through the cluster peer-to-peer. First node pulls from registry, subsequent nodes pull from peers. Distribution time drops from hours to minutes.

ModelPack packages models as OCI artifacts with structured metadata:

  • Model architecture and hyperparameters
  • Training dataset provenance
  • Inference requirements (GPU memory, compute)
  • Compatibility matrix (vLLM, TensorRT, ONNX)

This is the model equivalent of a Dockerfile β€” it describes not just the artifact but how to run it.

Putting It Together on Kubernetes

# Model stored in Harbor OCI registry
# Distributed via Dragonfly P2P
# Cached locally with Fluid
# Served with llm-d

apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  initContainers:
    - name: model-pull
      image: ghcr.io/oras-project/oras:latest
      command:
        - oras
        - pull
        - registry.internal/models/llama3-70b:v1.0
        - -o
        - /models/
      volumeMounts:
        - name: model-storage
          mountPath: /models
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args:
        - --model=/models/llama3-70b
        - --tensor-parallel-size=4
      resources:
        limits:
          nvidia.com/gpu: "4"
      volumeMounts:
        - name: model-storage
          mountPath: /models
  volumes:
    - name: model-storage
      emptyDir:
        sizeLimit: 200Gi

My Prediction

Within two years, the standard enterprise AI deployment will use:

  1. ORAS/Harbor for model storage and versioning
  2. Dragonfly for P2P distribution to GPU clusters
  3. Fluid for local caching and acceleration
  4. llm-d for inference-aware serving

The β€œmodel as OCI artifact” pattern will be as standard as β€œapp as container image.”

About the Author

I am Luca Berton, AI and Cloud Advisor. I design model delivery pipelines for enterprise AI platforms. Book a consultation.

Free 30-min AI & Cloud consultation

Book Now