Skip to main content
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
MLOps Pipeline Kubernetes Enterprise 2026
AI

MLOps on Kubernetes: Building an Enterprise

End-to-end MLOps pipeline on Kubernetes. Kubeflow, MLflow, feature stores, model registry, CI/CD for ML, automated retraining, A/B testing, and model.

LB
Luca Berton
Β· 1 min read

Data scientists build models in notebooks. The platform team’s job is to get those models to production reliably, repeatedly, and at scale.

MLOps on Kubernetes gives you the infrastructure to automate the entire lifecycle: data ingestion, feature engineering, training, validation, deployment, monitoring, and retraining. Here is the architecture that works for enterprises.

The MLOps Stack on Kubernetes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Orchestration Layer               β”‚
β”‚  Kubeflow Pipelines / Argo Workflows          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Data        β”‚ Training    β”‚ Serving          β”‚
β”‚  ─────────   β”‚ ─────────   β”‚ ─────────        β”‚
β”‚  Feature     β”‚ Distributed β”‚ vLLM / NIM       β”‚
β”‚  Store       β”‚ Training    β”‚ KServe           β”‚
β”‚  (Feast)     β”‚ (PyTorch)   β”‚ Seldon           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Experiment  β”‚ Model       β”‚ Monitoring       β”‚
β”‚  Tracking    β”‚ Registry    β”‚ ─────────        β”‚
β”‚  (MLflow)    β”‚ (MLflow)    β”‚ Evidently AI     β”‚
β”‚              β”‚             β”‚ Prometheus        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Kubernetes Platform              β”‚
β”‚  GPU Operator β”‚ Karpenter β”‚ ArgoCD β”‚ Vault    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pipeline Definition

A Kubeflow pipeline for model training and deployment:

from kfp import dsl, compiler

@dsl.component(base_image="python:3.11")
def preprocess_data(input_path: str, output_path: dsl.Output[dsl.Dataset]):
    """Feature engineering and data validation."""
    import pandas as pd
    df = pd.read_parquet(input_path)
    # Validate schema, check for drift, engineer features
    df.to_parquet(output_path.path)

@dsl.component(
    base_image="pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime",
    accelerator_type="nvidia.com/gpu",
    accelerator_count=4,
)
def train_model(
    dataset: dsl.Input[dsl.Dataset],
    model: dsl.Output[dsl.Model],
    epochs: int = 10,
):
    """Distributed training with PyTorch on 4 GPUs."""
    import torch
    # Training logic here
    torch.save(model_state, model.path)

@dsl.component(base_image="python:3.11")
def evaluate_model(
    model: dsl.Input[dsl.Model],
    test_data: dsl.Input[dsl.Dataset],
) -> float:
    """Evaluate model quality. Returns accuracy score."""
    # Evaluation logic
    return accuracy

@dsl.component(base_image="python:3.11")
def deploy_model(model: dsl.Input[dsl.Model], accuracy: float):
    """Deploy to KServe if accuracy exceeds threshold."""
    if accuracy >= 0.95:
        # Create/update KServe InferenceService
        pass

@dsl.pipeline(name="ml-training-pipeline")
def training_pipeline(input_data: str):
    preprocess = preprocess_data(input_path=input_data)
    train = train_model(dataset=preprocess.outputs["output_path"])
    evaluate = evaluate_model(
        model=train.outputs["model"],
        test_data=preprocess.outputs["output_path"],
    )
    deploy_model(
        model=train.outputs["model"],
        accuracy=evaluate.output,
    )

Feature Store

Feast on Kubernetes provides consistent features for training and serving:

# feature_store.yaml
project: enterprise_ml
provider: local
registry: s3://ml-platform/feast/registry.db
online_store:
  type: redis
  connection_string: "redis.ml-platform.svc.cluster.local:6379"
offline_store:
  type: file  # or bigquery, redshift, spark

Model Registry

MLflow tracks experiments and versions models:

import mlflow

mlflow.set_tracking_uri("http://mlflow.ml-platform.svc.cluster.local:5000")

with mlflow.start_run():
    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.96, "f1_score": 0.94})
    mlflow.pytorch.log_model(model, "model", registered_model_name="recommendation-v2")

Automated Retraining

Trigger retraining when model performance degrades:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: model-drift-check
spec:
  schedule: "0 6 * * *"  # Daily at 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: drift-detector
              image: ml-platform/drift-detector:latest
              env:
                - name: MODEL_ENDPOINT
                  value: "http://model-serving:8080"
                - name: DRIFT_THRESHOLD
                  value: "0.05"
                - name: RETRAIN_PIPELINE_URL
                  value: "http://kubeflow-pipelines:8888/api/v1/runs"

About the Author

I am Luca Berton, AI and Cloud Advisor. I build MLOps platforms for enterprises running machine learning at scale. Book a consultation.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens Heaven Art Shop TechMeOut

Free 30-min AI & Cloud consultation

Book Now