A quantized LLaMA-3 70B model weighs 140 GB. Frontier multimodal models exceed 1 TB. These are not files you git push β and the infrastructure gap between how we manage software containers and how we manage AI models is causing real production problems.
The CNCF community is fixing this with projects like ORAS, Harbor, Dragonfly, and ModelPack.
The Gap
Software containers are pulled from OCI registries with full versioning, security scanning, and rollback support. Model weights are:
- Downloaded via ad hoc scripts
- Copied manually between S3 buckets
- Distributed through unsecured NFS shares
- Version-controlled with folder names like
model-v3-final-FINAL
This creates deployment fragility, security risks, and operational overhead.
Three Approaches (None Sufficient Alone)
| Approach | Pros | Cons |
|---|---|---|
| Git LFS / Hugging Face Hub | Version control, history | Poor for cloud native, Git transport overhead |
| Object Storage (S3/MinIO) | Cloud provider support, vLLM native | No structured metadata, weak versioning |
| Distributed FS (NFS/CephFS) | POSIX compatible, low integration cost | No versioning, no access control |
The Cloud Native Solution: OCI Registries for Models
The key insight: treat model weights like container images. Store them in OCI registries using ORAS (OCI Registry As Storage).
# Push a model to an OCI registry
oras push registry.example.com/models/llama3-70b:v1.0 \
--artifact-type application/vnd.ai.model.v1 \
model-weights/ \
config.json
# Pull on a GPU node
oras pull registry.example.com/models/llama3-70b:v1.0Benefits
- Immutability β tagged versions never change
- Security scanning β same tools that scan containers
- Access control β registry authentication and RBAC
- Content-addressable β SHA256 integrity verification
- Replication β multi-region distribution via Harbor
Harbor for Model Management
Harbor (CNCF Graduated) manages model artifacts alongside container images:
- Model versioning with tags
- Vulnerability scanning of model artifacts
- Replication across registries for multi-region deployment
- Access control per model repository
- Audit logging for compliance
Dragonfly for P2P Distribution
Distributing 140 GB to 100 GPU nodes from a single registry is a bandwidth nightmare. Dragonfly (CNCF Incubating) adds P2P distribution:
Registry β Node1 β Node2 β Node3 (P2P mesh)
β Node4
Node1 β Node5Instead of 100 pulls from the registry, the model propagates through the cluster peer-to-peer. First node pulls from registry, subsequent nodes pull from peers. Distribution time drops from hours to minutes.
ModelPack: The Missing Link
ModelPack packages models as OCI artifacts with structured metadata:
- Model architecture and hyperparameters
- Training dataset provenance
- Inference requirements (GPU memory, compute)
- Compatibility matrix (vLLM, TensorRT, ONNX)
This is the model equivalent of a Dockerfile β it describes not just the artifact but how to run it.
Putting It Together on Kubernetes
# Model stored in Harbor OCI registry
# Distributed via Dragonfly P2P
# Cached locally with Fluid
# Served with llm-d
apiVersion: v1
kind: Pod
metadata:
name: inference-server
spec:
initContainers:
- name: model-pull
image: ghcr.io/oras-project/oras:latest
command:
- oras
- pull
- registry.internal/models/llama3-70b:v1.0
- -o
- /models/
volumeMounts:
- name: model-storage
mountPath: /models
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=/models/llama3-70b
- --tensor-parallel-size=4
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
emptyDir:
sizeLimit: 200GiMy Prediction
Within two years, the standard enterprise AI deployment will use:
- ORAS/Harbor for model storage and versioning
- Dragonfly for P2P distribution to GPU clusters
- Fluid for local caching and acceleration
- llm-d for inference-aware serving
The βmodel as OCI artifactβ pattern will be as standard as βapp as container image.β
Related Resources
- Fluid: Data Orchestration for AI
- llm-d: Distributed LLM Inference
- Kubernetes Persistent Volumes
- Docker Multi-Stage Builds
- Container Security with Trivy
- Kubernetes AI Conformance
About the Author
I am Luca Berton, AI and Cloud Advisor. I design model delivery pipelines for enterprise AI platforms. Book a consultation.