Building an ML platform on Kubernetes means choosing between Kubeflow, MLflow, or a combination of both. Each solves different problems, and most production setups use elements of both.
Kubeflow vs MLflow: Different Tools for Different Problems
| Capability | Kubeflow | MLflow |
|---|---|---|
| Primary focus | ML workflows on K8s | Experiment tracking & model registry |
| Pipeline orchestration | Native (KFP) | Via integrations |
| Experiment tracking | Limited | Excellent |
| Model registry | Basic | Excellent |
| Serving | KServe, Seldon | MLflow Serving (basic) |
| Notebook environment | JupyterHub integration | Not included |
| Kubernetes native | Yes (designed for K8s) | No (runs anywhere) |
| Complexity | High | Low-medium |
| Setup time | Days-weeks | Hours |
Architecture: Kubeflow + MLflow Together
The best ML platform uses both:
ββββββββββββ ML Platform on Kubernetes ββββββββββββ
β β
β ββββββββββββ ββββββββββββ βββββββββββββββββ β
β β Jupyter β β Kubeflow β β MLflow β β
β β Notebooksββ β Pipelinesββ β Tracking + β β
β β β β (KFP) β β Model Registryβ β
β ββββββββββββ ββββββββββββ βββββββββββββββββ β
β β β
β ββββββββββββββββ β
β β KServe β β
β β (Model β β
β β Serving) β β
β ββββββββββββββββ β
β β
β Storage: MinIO (artifacts) + PostgreSQL (metadata)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββKubeflow Setup on Kubernetes
Installation with kustomize
# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Install (requires kustomize)
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying..."
sleep 10
doneKubeflow Pipeline Example
from kfp import dsl, compiler
@dsl.component(base_image="python:3.11")
def preprocess(data_path: str) -> str:
import pandas as pd
df = pd.read_csv(data_path)
processed = df.dropna().reset_index(drop=True)
output_path = "/tmp/processed.csv"
processed.to_csv(output_path, index=False)
return output_path
@dsl.component(base_image="python:3.11-slim")
def train(data_path: str, model_name: str) -> str:
import mlflow
from sklearn.ensemble import RandomForestClassifier
mlflow.set_tracking_uri("http://mlflow-server:5000")
with mlflow.start_run():
# Train model
model = RandomForestClassifier(n_estimators=100)
# ... training code ...
mlflow.sklearn.log_model(model, model_name)
return model_name
@dsl.pipeline(name="ml-training-pipeline")
def training_pipeline(data_path: str = "s3://data/train.csv"):
preprocess_task = preprocess(data_path=data_path)
train_task = train(
data_path=preprocess_task.output,
model_name="fraud-detector"
)
compiler.Compiler().compile(training_pipeline, "pipeline.yaml")MLflow Setup on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-server
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:2.14.0
command: ["mlflow", "server"]
args:
- "--backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow"
- "--default-artifact-root=s3://mlflow-artifacts/"
- "--host=0.0.0.0"
- "--port=5000"
ports:
- containerPort: 5000
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-credentials
key: access-keyModel Serving with KServe
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
spec:
predictor:
model:
modelFormat:
name: mlflow
storageUri: "s3://mlflow-artifacts/fraud-detector/latest"
resources:
limits:
nvidia.com/gpu: 1Production Checklist
- Centralized experiment tracking (MLflow)
- Reproducible pipelines (Kubeflow Pipelines)
- Model versioning and registry
- A/B testing infrastructure for model comparison
- GPU scheduling and resource quotas
- Data versioning (DVC or LakeFS)
- Monitoring for model drift
- Automated retraining triggers