MLOps: Where Ansible Meets Kubeflow
MLOps pipelines need two things: reproducible ML workflows (Kubeflow) and reproducible infrastructure (Ansible). Together, they automate the entire lifecycle from data preparation to model serving.
Architecture
Ansible (Infrastructure Layer)
βββ Provision GPU nodes
βββ Install Kubeflow
βββ Configure storage (S3/Ceph)
βββ Set up monitoring
Kubeflow (ML Layer)
βββ Data preparation pipeline
βββ Model training
βββ Evaluation & validation
βββ Model deployment (KServe)Ansible: The Infrastructure Layer
Install Kubeflow
---
- name: Deploy Kubeflow on Kubernetes
hosts: k8s_control_plane
tasks:
- name: Add Kubeflow manifests
ansible.builtin.git:
repo: https://github.com/kubeflow/manifests
dest: /opt/kubeflow-manifests
version: v1.9.0
- name: Install Kubeflow with kustomize
kubernetes.core.k8s:
state: present
src: "{{ item }}"
loop:
- /opt/kubeflow-manifests/common/cert-manager/
- /opt/kubeflow-manifests/common/istio/
- /opt/kubeflow-manifests/apps/pipeline/
- name: Configure GPU node pool
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: Node
metadata:
labels:
accelerator: nvidia-a100
name: "{{ item }}"
loop: "{{ gpu_nodes }}"
- name: Install NVIDIA device plugin
kubernetes.core.helm:
name: nvidia-device-plugin
chart_ref: nvidia/k8s-device-plugin
release_namespace: kube-systemKubeflow: The ML Layer
Training Pipeline
from kfp import dsl
@dsl.component(base_image="python:3.11")
def prepare_data(dataset_path: str, output_path: dsl.Output[dsl.Dataset]):
import pandas as pd
df = pd.read_parquet(dataset_path)
df_clean = df.dropna().drop_duplicates()
df_clean.to_parquet(output_path.path)
@dsl.component(base_image="pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime")
def train_model(dataset: dsl.Input[dsl.Dataset], model: dsl.Output[dsl.Model]):
import torch
# Training logic here
torch.save(trained_model.state_dict(), model.path)
@dsl.component
def evaluate_model(model: dsl.Input[dsl.Model]) -> float:
# Evaluation logic
return accuracy
@dsl.pipeline(name="training-pipeline")
def training_pipeline(dataset_path: str):
data = prepare_data(dataset_path=dataset_path)
model = train_model(dataset=data.outputs["output_path"])
evaluation = evaluate_model(model=model.outputs["model"])Automated Retraining with Ansible + Cron
---
- name: Trigger model retraining
hosts: localhost
tasks:
- name: Check model drift
uri:
url: "http://monitoring.internal/api/v1/query"
body: '{"query": "model_accuracy_score < 0.85"}'
method: POST
register: drift_check
- name: Trigger Kubeflow pipeline
uri:
url: "http://kubeflow.internal/pipeline/apis/v2beta1/runs"
method: POST
body_format: json
body:
display_name: "Automated retrain - {{ ansible_date_time.iso8601 }}"
pipeline_version_reference:
pipeline_id: "training-pipeline"
when: drift_check.json.data.result | length > 0Key Practices
- Version everything β data, code, models, and infrastructure
- Ansible for infra, Kubeflow for ML β donβt mix concerns
- Automate retraining β trigger on drift, not on schedule
- Test in staging β full pipeline dry runs before production
- Track lineage β every model should trace back to its data and code
Building MLOps pipelines? I help teams automate the full ML lifecycle with Ansible and Kubeflow. Letβs connect.\n
