Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
ML Platform on Kubernetes with Kubeflow MLflow
Platform Engineering

Building an ML Platform on Kubernetes

An ML platform is more than deploying Kubeflow. Integrate experiment tracking, feature stores, model registries, and serving infrastructure on Kubernetes.

LB
Luca Berton
Β· 1 min read

Building an ML platform on Kubernetes means choosing between Kubeflow, MLflow, or a combination of both. Each solves different problems, and most production setups use elements of both.

Kubeflow vs MLflow: Different Tools for Different Problems

CapabilityKubeflowMLflow
Primary focusML workflows on K8sExperiment tracking & model registry
Pipeline orchestrationNative (KFP)Via integrations
Experiment trackingLimitedExcellent
Model registryBasicExcellent
ServingKServe, SeldonMLflow Serving (basic)
Notebook environmentJupyterHub integrationNot included
Kubernetes nativeYes (designed for K8s)No (runs anywhere)
ComplexityHighLow-medium
Setup timeDays-weeksHours

Architecture: Kubeflow + MLflow Together

The best ML platform uses both:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ ML Platform on Kubernetes ───────────┐
β”‚                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Jupyter  β”‚  β”‚ Kubeflow β”‚  β”‚ MLflow        β”‚  β”‚
β”‚  β”‚ Notebooksβ”‚β†’ β”‚ Pipelinesβ”‚β†’ β”‚ Tracking +    β”‚  β”‚
β”‚  β”‚          β”‚  β”‚ (KFP)    β”‚  β”‚ Model Registryβ”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                      ↓                            β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚              β”‚ KServe       β”‚                     β”‚
β”‚              β”‚ (Model       β”‚                     β”‚
β”‚              β”‚  Serving)    β”‚                     β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                                   β”‚
β”‚  Storage: MinIO (artifacts) + PostgreSQL (metadata)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kubeflow Setup on Kubernetes

Installation with kustomize

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install (requires kustomize)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

Kubeflow Pipeline Example

from kfp import dsl, compiler

@dsl.component(base_image="python:3.11")
def preprocess(data_path: str) -> str:
    import pandas as pd
    df = pd.read_csv(data_path)
    processed = df.dropna().reset_index(drop=True)
    output_path = "/tmp/processed.csv"
    processed.to_csv(output_path, index=False)
    return output_path

@dsl.component(base_image="python:3.11-slim")
def train(data_path: str, model_name: str) -> str:
    import mlflow
    from sklearn.ensemble import RandomForestClassifier
    
    mlflow.set_tracking_uri("http://mlflow-server:5000")
    with mlflow.start_run():
        # Train model
        model = RandomForestClassifier(n_estimators=100)
        # ... training code ...
        mlflow.sklearn.log_model(model, model_name)
    return model_name

@dsl.pipeline(name="ml-training-pipeline")
def training_pipeline(data_path: str = "s3://data/train.csv"):
    preprocess_task = preprocess(data_path=data_path)
    train_task = train(
        data_path=preprocess_task.output,
        model_name="fraud-detector"
    )

compiler.Compiler().compile(training_pipeline, "pipeline.yaml")

MLflow Setup on Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
        - name: mlflow
          image: ghcr.io/mlflow/mlflow:2.14.0
          command: ["mlflow", "server"]
          args:
            - "--backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow"
            - "--default-artifact-root=s3://mlflow-artifacts/"
            - "--host=0.0.0.0"
            - "--port=5000"
          ports:
            - containerPort: 5000
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: access-key

Model Serving with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    model:
      modelFormat:
        name: mlflow
      storageUri: "s3://mlflow-artifacts/fraud-detector/latest"
      resources:
        limits:
          nvidia.com/gpu: 1

Production Checklist

  • Centralized experiment tracking (MLflow)
  • Reproducible pipelines (Kubeflow Pipelines)
  • Model versioning and registry
  • A/B testing infrastructure for model comparison
  • GPU scheduling and resource quotas
  • Data versioning (DVC or LakeFS)
  • Monitoring for model drift
  • Automated retraining triggers

Free 30-min AI & Cloud consultation

Book Now