βWe have Velero backupsβ is not a disaster recovery plan. DR means you can recover your entire platform to a defined state within a defined time, and you have tested it.
DR Patterns
Pattern 1: Active-Passive
Region A (Active) Region B (Passive)
ββββββββββββββββ ββββββββββββββββ
β Production β Velero β Standby β
β Cluster βββbackupβββΆβ Cluster β
β Traffic: 100%β hourly β Traffic: 0% β
β β β Scaled down β
ββββββββββββββββ ββββββββββββββββ
β β
βββββββ Global LB βββββββββ
(DNS failover)RTO: 15-60 minutes (scale up standby + restore state) RPO: 1 hour (backup frequency) Cost: ~30% of production (standby runs minimal resources)
Pattern 2: Active-Active
Region A Region B
ββββββββββββββββ ββββββββββββββββ
β Production ββββ sync ββΆβ Production β
β Cluster β β Cluster β
β Traffic: 50% β β Traffic: 50% β
ββββββββββββββββ ββββββββββββββββ
β β
βββββββ Global LB βββββββββ
(weighted routing)RTO: under 1 minute (traffic shifts automatically) RPO: Near-zero (synchronous or near-synchronous replication) Cost: ~100% of production (both regions run full capacity)
Pattern 3: Pilot Light
Region A (Active) Region B (Pilot Light)
ββββββββββββββββ ββββββββββββββββ
β Production β ArgoCD β Control planeβ
β Full stack βββsyncβββββΆβ running β
β Traffic: 100%β β No workers β
β β β Data synced β
ββββββββββββββββ ββββββββββββββββRTO: 5-15 minutes (add worker nodes via Cluster API/Karpenter) RPO: Minutes (continuous data replication) Cost: ~10-15% of production (control plane + storage only)
Velero Backup Strategy
# Scheduled backup for production namespace
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-hourly
spec:
schedule: "0 * * * *"
template:
includedNamespaces:
- production
- monitoring
- ingress-system
includedResources:
- deployments
- services
- configmaps
- secrets
- persistentvolumeclaims
storageLocation: s3-cross-region
volumeSnapshotLocations:
- ebs-cross-region
ttl: 720h # 30 days retentionDR Runbook Template
# Disaster Recovery Runbook: Region A Failure
## Trigger Criteria
- Region A unreachable for > 5 minutes
- Multiple service health checks failing
- Cloud provider status page confirms outage
## Step 1: Confirm Outage (2 minutes)
- [ ] Verify from multiple monitoring points
- [ ] Check cloud provider status page
- [ ] Notify incident commander
## Step 2: Activate DR (5 minutes)
- [ ] Scale up Region B worker nodes: `kubectl scale ...`
- [ ] Restore latest Velero backup: `velero restore create ...`
- [ ] Verify pods are running: `kubectl get pods -A`
## Step 3: Switch Traffic (2 minutes)
- [ ] Update DNS/Global LB to route to Region B
- [ ] Verify traffic is flowing to Region B
- [ ] Monitor error rates for 5 minutes
## Step 4: Communicate
- [ ] Notify stakeholders: "Failover complete, RTO: X minutes"
- [ ] Update status page
## Rollback (when Region A recovers)
- [ ] Sync data from Region B back to Region A
- [ ] Gradually shift traffic back (10% β 50% β 100%)
- [ ] Scale down Region B to standbyTesting DR
If you have not tested it, it does not work. Schedule quarterly DR tests:
- Tabletop exercise β walk through the runbook verbally
- Component test β restore a single namespace from backup
- Full failover test β simulate region failure, execute full runbook
- Chaos engineering β inject failures randomly (Chaos Mesh, Litmus)
Related Resources
- Kubernetes Multi-Cluster Management
- Enterprise Kubernetes Security
- Enterprise Observability Stack
- Velero vs Kasten
- ArgoCD Cheat Sheet
About the Author
I am Luca Berton, AI and Cloud Advisor. I design disaster recovery architectures for enterprise Kubernetes platforms. Book a consultation.
