The AI Model Weight Problem

A quantized LLaMA-3 70B model weighs 140 GB. Frontier multimodal models exceed 1 TB. These are not files you git push — and the infrastructure gap between how we manage software containers and how we manage AI models is causing real production problems.

The CNCF community is fixing this with projects like ORAS, Harbor, Dragonfly, and ModelPack.

The Gap

Software containers are pulled from OCI registries with full versioning, security scanning, and rollback support. Model weights are:

Downloaded via ad hoc scripts
Copied manually between S3 buckets
Distributed through unsecured NFS shares
Version-controlled with folder names like model-v3-final-FINAL

This creates deployment fragility, security risks, and operational overhead.

Three Approaches (None Sufficient Alone)

Approach	Pros	Cons
Git LFS / Hugging Face Hub	Version control, history	Poor for cloud native, Git transport overhead
Object Storage (S3/MinIO)	Cloud provider support, vLLM native	No structured metadata, weak versioning
Distributed FS (NFS/CephFS)	POSIX compatible, low integration cost	No versioning, no access control

The Cloud Native Solution: OCI Registries for Models

The key insight: treat model weights like container images. Store them in OCI registries using ORAS (OCI Registry As Storage).

# Push a model to an OCI registry
oras push registry.example.com/models/llama3-70b:v1.0 \
  --artifact-type application/vnd.ai.model.v1 \
  model-weights/ \
  config.json

# Pull on a GPU node
oras pull registry.example.com/models/llama3-70b:v1.0

Benefits

Immutability — tagged versions never change
Security scanning — same tools that scan containers
Access control — registry authentication and RBAC
Content-addressable — SHA256 integrity verification
Replication — multi-region distribution via Harbor

Harbor for Model Management

Harbor (CNCF Graduated) manages model artifacts alongside container images:

Model versioning with tags
Vulnerability scanning of model artifacts
Replication across registries for multi-region deployment
Access control per model repository
Audit logging for compliance

Dragonfly for P2P Distribution

Distributing 140 GB to 100 GPU nodes from a single registry is a bandwidth nightmare. Dragonfly (CNCF Incubating) adds P2P distribution:

Registry → Node1 → Node2 → Node3 (P2P mesh)
                 ↘ Node4
         Node1 → Node5

Instead of 100 pulls from the registry, the model propagates through the cluster peer-to-peer. First node pulls from registry, subsequent nodes pull from peers. Distribution time drops from hours to minutes.

ModelPack: The Missing Link

ModelPack packages models as OCI artifacts with structured metadata:

Model architecture and hyperparameters
Training dataset provenance
Inference requirements (GPU memory, compute)
Compatibility matrix (vLLM, TensorRT, ONNX)

This is the model equivalent of a Dockerfile — it describes not just the artifact but how to run it.

Putting It Together on Kubernetes

# Model stored in Harbor OCI registry
# Distributed via Dragonfly P2P
# Cached locally with Fluid
# Served with llm-d

apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  initContainers:
    - name: model-pull
      image: ghcr.io/oras-project/oras:latest
      command:
        - oras
        - pull
        - registry.internal/models/llama3-70b:v1.0
        - -o
        - /models/
      volumeMounts:
        - name: model-storage
          mountPath: /models
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args:
        - --model=/models/llama3-70b
        - --tensor-parallel-size=4
      resources:
        limits:
          nvidia.com/gpu: "4"
      volumeMounts:
        - name: model-storage
          mountPath: /models
  volumes:
    - name: model-storage
      emptyDir:
        sizeLimit: 200Gi

My Prediction

Within two years, the standard enterprise AI deployment will use:

ORAS/Harbor for model storage and versioning
Dragonfly for P2P distribution to GPU clusters
Fluid for local caching and acceleration
llm-d for inference-aware serving

The “model as OCI artifact” pattern will be as standard as “app as container image.”

About the Author

I am Luca Berton, AI and Cloud Advisor. I design model delivery pipelines for enterprise AI platforms. Book a consultation.

The AI Model Weight Problem

The Gap

Three Approaches (None Sufficient Alone)

The Cloud Native Solution: OCI Registries for Models

Benefits

Harbor for Model Management

Dragonfly for P2P Distribution

ModelPack: The Missing Link

Putting It Together on Kubernetes

My Prediction

About the Author

Related Articles

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like

Wiz Club Amsterdam 2026: Machine-Speed Cloud and AI Security

Claude API Pricing 2026: Fable, Opus, Sonnet 5, and Haiku Compared

The Gap

Three Approaches (None Sufficient Alone)

The Cloud Native Solution: OCI Registries for Models

Benefits

Harbor for Model Management

Dragonfly for P2P Distribution

ModelPack: The Missing Link

Putting It Together on Kubernetes

My Prediction

Related Resources

About the Author

Related Articles

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like

Wiz Club Amsterdam 2026: Machine-Speed Cloud and AI Security

Claude API Pricing 2026: Fable, Opus, Sonnet 5, and Haiku Compared