Why On-Prem MLOps?
βJust use SageMakerβ doesnβt work when:
- Patient data canβt leave the hospital network (HIPAA/GDPR)
- Defense workloads require air-gapped environments
- GPU cloud costs at scale exceed hardware ownership
- Your 50TB training dataset costs $5K/month just for cloud storage egress
Kubeflow on Kubernetes gives you SageMaker-class ML infrastructure on hardware you control.
Architecture
On-Prem Kubernetes Cluster
βββ Kubeflow Central Dashboard
βββ Kubeflow Pipelines (orchestration)
βββ KServe (model serving)
βββ Katib (hyperparameter tuning)
βββ Training Operator (distributed training)
βββ MinIO (artifact storage)
βββ MySQL (metadata store)
βββ GPU Nodes (NVIDIA A100/H100)Installation with Kustomize
# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Install everything
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying..."
sleep 10
doneFor production, customize the installation:
# kustomization.yaml β production overrides
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- github.com/kubeflow/manifests//common/cert-manager
- github.com/kubeflow/manifests//common/istio
- github.com/kubeflow/manifests//apps/pipeline
- github.com/kubeflow/manifests//apps/kserve
- github.com/kubeflow/manifests//apps/training-operator
- github.com/kubeflow/manifests//apps/katib
patches:
- path: patches/minio-pvc.yaml # Use real storage, not emptyDir
- path: patches/mysql-ha.yaml # HA MySQL for metadata
- path: patches/gpu-nodepool.yaml # GPU scheduling configBuilding an ML Pipeline
from kfp import dsl, compiler
@dsl.component(base_image="python:3.12")
def preprocess_data(input_path: str, output_path: dsl.OutputPath()):
import pandas as pd
df = pd.read_parquet(input_path)
df = df.dropna().reset_index(drop=True)
# Feature engineering...
df.to_parquet(output_path)
@dsl.component(base_image="pytorch/pytorch:2.5-cuda12.4")
def train_model(data_path: str, model_path: dsl.OutputPath(), epochs: int = 10):
import torch
# Training loop...
torch.save(model.state_dict(), model_path)
@dsl.component(base_image="python:3.12")
def evaluate_model(model_path: str, test_data: str) -> float:
# Evaluation...
return accuracy
@dsl.component
def deploy_model(model_path: str, accuracy: float):
if accuracy > 0.95:
# Deploy to KServe
pass
@dsl.pipeline(name="training-pipeline")
def ml_pipeline(data_path: str):
preprocess = preprocess_data(input_path=data_path)
train = train_model(data_path=preprocess.output, epochs=20)
evaluate = evaluate_model(
model_path=train.output,
test_data=data_path
)
deploy_model(
model_path=train.output,
accuracy=evaluate.output
)
compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")GPU Management
# NVIDIA GPU Operator for Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
spec:
containers:
- name: trainer
image: registry.internal/ml-trainer:v2
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GBFor multi-GPU distributed training:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: trainer
image: registry.internal/trainer:v2
resources:
limits:
nvidia.com/gpu: 4
Worker:
replicas: 3
template:
spec:
containers:
- name: trainer
image: registry.internal/trainer:v2
resources:
limits:
nvidia.com/gpu: 4Infrastructure Automation
I deploy the entire Kubeflow stack with Ansible:
- name: Deploy Kubeflow on-prem
hosts: k8s_ml_cluster
roles:
- role: nvidia-gpu-operator
- role: minio-ha
- role: mysql-ha
- role: kubeflow
vars:
kubeflow_version: "1.9"
storage_class: ceph-block
gpu_scheduling: exclusiveKubernetes cluster provisioning at Kubernetes Recipes. Ansible automation at Ansible Pilot. Infrastructure provisioning with Terraform at Terraform Pilot.
On-Prem vs Cloud Cost (3-Year TCO)
Workload: 8 GPUs, continuous training + serving
Cloud (AWS p4d.24xlarge):
On-demand: $32.77/hr Γ 8,760h Γ 3yr = $861,000
Reserved: $19.22/hr Γ 8,760h Γ 3yr = $505,000
On-Prem (8Γ A100 server):
Hardware: $250,000
Colocation: $2,000/mo Γ 36mo = $72,000
Power: $1,500/mo Γ 36mo = $54,000
Ops (0.5 FTE): $75,000/yr Γ 3 = $225,000
Total: $601,000On-prem wins at sustained high utilization. Cloud wins for bursty workloads. Hybrid (on-prem base + cloud burst) is often optimal.
The Regulated Industry Advantage
For healthcare, defense, and financial services, on-prem MLOps isnβt just about cost β itβs about compliance. Data never leaves your network, model provenance is fully auditable, and you control the entire stack. Kubeflow makes this enterprise-grade without building everything from scratch.
