Skip to main content
πŸš€ Claude Code Bootcamp β€” May 30 5 hours from prompting to production. Build 10 real-world projects with AI-assisted development. Register Now
Disaster Recovery Kubernetes Enterprise 2026
Platform Engineering

Disaster Recovery for Kubernetes: Enterprise

Kubernetes disaster recovery beyond Velero backups. Active-active, active-passive, pilot light patterns with RTO/RPO targets, automated failover, and.

LB
Luca Berton
Β· 1 min read

β€œWe have Velero backups” is not a disaster recovery plan. DR means you can recover your entire platform to a defined state within a defined time, and you have tested it.

DR Patterns

Pattern 1: Active-Passive

Region A (Active)           Region B (Passive)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Production   β”‚  Velero   β”‚ Standby      β”‚
β”‚ Cluster      │──backup──▢│ Cluster      β”‚
β”‚ Traffic: 100%β”‚  hourly   β”‚ Traffic: 0%  β”‚
β”‚              β”‚           β”‚ Scaled down  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                          β”‚
      └────── Global LB β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              (DNS failover)

RTO: 15-60 minutes (scale up standby + restore state) RPO: 1 hour (backup frequency) Cost: ~30% of production (standby runs minimal resources)

Pattern 2: Active-Active

Region A                    Region B
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Production   │◀── sync ─▢│ Production   β”‚
β”‚ Cluster      β”‚           β”‚ Cluster      β”‚
β”‚ Traffic: 50% β”‚           β”‚ Traffic: 50% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                          β”‚
      └────── Global LB β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              (weighted routing)

RTO: under 1 minute (traffic shifts automatically) RPO: Near-zero (synchronous or near-synchronous replication) Cost: ~100% of production (both regions run full capacity)

Pattern 3: Pilot Light

Region A (Active)           Region B (Pilot Light)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Production   β”‚  ArgoCD   β”‚ Control planeβ”‚
β”‚ Full stack   │──sync────▢│ running      β”‚
β”‚ Traffic: 100%β”‚           β”‚ No workers   β”‚
β”‚              β”‚           β”‚ Data synced  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RTO: 5-15 minutes (add worker nodes via Cluster API/Karpenter) RPO: Minutes (continuous data replication) Cost: ~10-15% of production (control plane + storage only)

Velero Backup Strategy

# Scheduled backup for production namespace
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-hourly
spec:
  schedule: "0 * * * *"
  template:
    includedNamespaces:
      - production
      - monitoring
      - ingress-system
    includedResources:
      - deployments
      - services
      - configmaps
      - secrets
      - persistentvolumeclaims
    storageLocation: s3-cross-region
    volumeSnapshotLocations:
      - ebs-cross-region
    ttl: 720h  # 30 days retention

DR Runbook Template

# Disaster Recovery Runbook: Region A Failure

## Trigger Criteria
- Region A unreachable for > 5 minutes
- Multiple service health checks failing
- Cloud provider status page confirms outage

## Step 1: Confirm Outage (2 minutes)
- [ ] Verify from multiple monitoring points
- [ ] Check cloud provider status page
- [ ] Notify incident commander

## Step 2: Activate DR (5 minutes)
- [ ] Scale up Region B worker nodes: `kubectl scale ...`
- [ ] Restore latest Velero backup: `velero restore create ...`
- [ ] Verify pods are running: `kubectl get pods -A`

## Step 3: Switch Traffic (2 minutes)
- [ ] Update DNS/Global LB to route to Region B
- [ ] Verify traffic is flowing to Region B
- [ ] Monitor error rates for 5 minutes

## Step 4: Communicate
- [ ] Notify stakeholders: "Failover complete, RTO: X minutes"
- [ ] Update status page

## Rollback (when Region A recovers)
- [ ] Sync data from Region B back to Region A
- [ ] Gradually shift traffic back (10% β†’ 50% β†’ 100%)
- [ ] Scale down Region B to standby

Testing DR

If you have not tested it, it does not work. Schedule quarterly DR tests:

  1. Tabletop exercise β€” walk through the runbook verbally
  2. Component test β€” restore a single namespace from backup
  3. Full failover test β€” simulate region failure, execute full runbook
  4. Chaos engineering β€” inject failures randomly (Chaos Mesh, Litmus)

About the Author

I am Luca Berton, AI and Cloud Advisor. I design disaster recovery architectures for enterprise Kubernetes platforms. Book a consultation.

Free 30-min AI & Cloud consultation

Book Now