Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
etcd Backup and Maintenance for Production Kubernetes Clusters
DevOps

etcd Backup & Maintenance for Production Kubernetes

Protect your cluster state β€” automated etcd snapshots, defragmentation schedules, performance tuning, and disaster recovery procedures.

LB
Luca Berton
Β· 1 min read

Why etcd Matters

etcd stores everything in your Kubernetes cluster β€” every pod, service, secret, configmap, RBAC rule. Lose etcd, lose your cluster.

Kubernetes API Server β†’ etcd (single source of truth)
                         β”‚
                         β”œβ”€β”€ All resource definitions
                         β”œβ”€β”€ All secrets (encrypted at rest)
                         β”œβ”€β”€ All RBAC policies
                         β”œβ”€β”€ All CRD instances
                         └── Lease objects (leader election)

Automated Backup

CronJob Approach

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
            - effect: NoSchedule
              operator: Exists
          containers:
            - name: backup
              image: registry.k8s.io/etcd:3.5.15
              command:
                - /bin/sh
                - -c
                - |
                  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                    --cert=/etc/kubernetes/pki/etcd/server.crt \
                    --key=/etc/kubernetes/pki/etcd/server.key
                  # Upload to S3
                  aws s3 cp /backup/etcd-*.db s3://cluster-backups/etcd/
                  # Keep only last 10 local
                  ls -t /backup/*.db | tail -n +11 | xargs rm -f
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
                - name: backup-dir
                  mountPath: /backup
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
            - name: backup-dir
              hostPath:
                path: /var/lib/etcd-backups

Restore Procedure

# 1. Stop API server and etcd on all control plane nodes
systemctl stop kubelet

# 2. Restore snapshot
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --name=control-plane-1 \
  --initial-cluster=control-plane-1=https://10.0.0.1:2380 \
  --initial-advertise-peer-urls=https://10.0.0.1:2380

# 3. Replace etcd data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd

# 4. Restart
systemctl start kubelet

Defragmentation

etcd accumulates dead space over time (deleted keys leave gaps). Defrag periodically:

# Check fragmentation
etcdctl endpoint status --write-out=table
# Look at DB SIZE vs DB SIZE IN USE

# Defragment (one member at a time!)
etcdctl defrag --endpoints=https://etcd-0:2379
etcdctl defrag --endpoints=https://etcd-1:2379
etcdctl defrag --endpoints=https://etcd-2:2379

Schedule: Weekly defrag during maintenance windows. Never defrag all members simultaneously.

Performance Tuning

# /etc/kubernetes/manifests/etcd.yaml
spec:
  containers:
    - command:
        - etcd
        - --quota-backend-bytes=8589934592    # 8GB (default 2GB)
        - --auto-compaction-retention=8h       # Compact history older than 8h
        - --auto-compaction-mode=periodic
        - --snapshot-count=10000               # Snapshot every 10K transactions
        - --heartbeat-interval=250             # 250ms (latency-sensitive)
        - --election-timeout=2500              # 2.5s

Monitoring

MetricWarningCritical
DB sizeover 4GBover 6GB
Leader changesover 3/hourover 10/hour
WAL fsync durationover 50msover 100ms
Proposal failuresover 0over 5/min
gRPC request duration (P99)over 100msover 500ms
# PrometheusRule
- alert: EtcdDBSizeHigh
  expr: etcd_mvcc_db_total_size_in_bytes > 6e9
  for: 5m
  labels:
    severity: critical

Disaster Scenarios

ScenarioRecovery
1 of 3 members lostAuto-heals (quorum intact)
2 of 3 members lostRestore from snapshot (quorum lost)
All members lostRestore from latest backup
Data corruptionRestore from backup + reconcile
Disk fullEmergency compaction + defrag

Free 30-min AI & Cloud consultation

Book Now