ML Platform on Kubernetes with Kubeflow MLflow

Building an ML platform on Kubernetes means choosing between Kubeflow, MLflow, or a combination of both. Each solves different problems, and most production setups use elements of both.

Kubeflow vs MLflow: Different Tools for Different Problems

Capability	Kubeflow	MLflow
Primary focus	ML workflows on K8s	Experiment tracking & model registry
Pipeline orchestration	Native (KFP)	Via integrations
Experiment tracking	Limited	Excellent
Model registry	Basic	Excellent
Serving	KServe, Seldon	MLflow Serving (basic)
Notebook environment	JupyterHub integration	Not included
Kubernetes native	Yes (designed for K8s)	No (runs anywhere)
Complexity	High	Low-medium
Setup time	Days-weeks	Hours

Architecture: Kubeflow + MLflow Together

The best ML platform uses both:

┌─────────── ML Platform on Kubernetes ───────────┐
│                                                   │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Jupyter  │  │ Kubeflow │  │ MLflow        │  │
│  │ Notebooks│→ │ Pipelines│→ │ Tracking +    │  │
│  │          │  │ (KFP)    │  │ Model Registry│  │
│  └──────────┘  └──────────┘  └───────────────┘  │
│                      ↓                            │
│              ┌──────────────┐                     │
│              │ KServe       │                     │
│              │ (Model       │                     │
│              │  Serving)    │                     │
│              └──────────────┘                     │
│                                                   │
│  Storage: MinIO (artifacts) + PostgreSQL (metadata)│
└───────────────────────────────────────────────────┘

Kubeflow Setup on Kubernetes

Installation with kustomize

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install (requires kustomize)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

Kubeflow Pipeline Example

from kfp import dsl, compiler

@dsl.component(base_image="python:3.11")
def preprocess(data_path: str) -> str:
    import pandas as pd
    df = pd.read_csv(data_path)
    processed = df.dropna().reset_index(drop=True)
    output_path = "/tmp/processed.csv"
    processed.to_csv(output_path, index=False)
    return output_path

@dsl.component(base_image="python:3.11-slim")
def train(data_path: str, model_name: str) -> str:
    import mlflow
    from sklearn.ensemble import RandomForestClassifier
    
    mlflow.set_tracking_uri("http://mlflow-server:5000")
    with mlflow.start_run():
        # Train model
        model = RandomForestClassifier(n_estimators=100)
        # ... training code ...
        mlflow.sklearn.log_model(model, model_name)
    return model_name

@dsl.pipeline(name="ml-training-pipeline")
def training_pipeline(data_path: str = "s3://data/train.csv"):
    preprocess_task = preprocess(data_path=data_path)
    train_task = train(
        data_path=preprocess_task.output,
        model_name="fraud-detector"
    )

compiler.Compiler().compile(training_pipeline, "pipeline.yaml")

MLflow Setup on Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
        - name: mlflow
          image: ghcr.io/mlflow/mlflow:2.14.0
          command: ["mlflow", "server"]
          args:
            - "--backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow"
            - "--default-artifact-root=s3://mlflow-artifacts/"
            - "--host=0.0.0.0"
            - "--port=5000"
          ports:
            - containerPort: 5000
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: access-key

Model Serving with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    model:
      modelFormat:
        name: mlflow
      storageUri: "s3://mlflow-artifacts/fraud-detector/latest"
      resources:
        limits:
          nvidia.com/gpu: 1

Production Checklist

Centralized experiment tracking (MLflow)
Reproducible pipelines (Kubeflow Pipelines)
Model versioning and registry
A/B testing infrastructure for model comparison
GPU scheduling and resource quotas
Data versioning (DVC or LakeFS)
Monitoring for model drift
Automated retraining triggers

Building an ML Platform on Kubernetes

Kubeflow vs MLflow: Different Tools for Different Problems

Architecture: Kubeflow + MLflow Together

Kubeflow Setup on Kubernetes

Installation with kustomize

Kubeflow Pipeline Example

MLflow Setup on Kubernetes

Model Serving with KServe

Production Checklist

Related Articles

Managing AI Agents at Platform Scale: Cloudsmith's Take

Securing Agentic AI Traffic: Gravitee at PlatformCon 2026

Isovalent (Now Part of Cisco) on Simplifying Kubernetes Networking

Kief Morris on AI Agents and Being 'Human on the Loop'

Kubeflow vs MLflow: Different Tools for Different Problems

Architecture: Kubeflow + MLflow Together

Kubeflow Setup on Kubernetes

Installation with kustomize

Kubeflow Pipeline Example

MLflow Setup on Kubernetes

Model Serving with KServe

Production Checklist

Related Reading

Related Articles

Managing AI Agents at Platform Scale: Cloudsmith's Take

Securing Agentic AI Traffic: Gravitee at PlatformCon 2026

Isovalent (Now Part of Cisco) on Simplifying Kubernetes Networking

Kief Morris on AI Agents and Being 'Human on the Loop'