Everything fails, all the time
Werner Vogels said it. Every infrastructure engineer has lived it. Disks fail, networks partition, entire availability zones go dark (as we saw with AWS Dubai). The question is not if your infrastructure will fail, but how it behaves when it does.
Resiliency is not about preventing failure β it is about designing systems that continue to function despite failure. The difference between a 5-minute blip and a 12-hour outage is almost always architectural.
The resiliency hierarchy
Think of resiliency as layers, from innermost (application) to outermost (organizational):
Layer 5: Organizational β Runbooks, on-call, post-mortems
Layer 4: Platform β Multi-AZ, auto-scaling, self-healing
Layer 3: Service β Circuit breakers, retries, bulkheads
Layer 2: Data β Replication, backups, consistency
Layer 1: Compute β Health checks, restarts, redundancyYou need all five. Most teams focus on Layer 1 and 4, neglect Layer 3, and ignore Layer 5 until an incident forces it.
Pattern 1: Circuit breakers
When a downstream service is failing, stop calling it. Hammering a dead service makes everything worse:
// Resilience4j circuit breaker
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
return paymentClient.charge(request);
}
public PaymentResult paymentFallback(PaymentRequest request, Exception e) {
log.warn("Payment service unavailable, queueing for retry");
paymentQueue.enqueue(request);
return PaymentResult.queued(request.getId());
}Circuit breaker states:
CLOSED β Normal operation, requests flow through
If failure rate exceeds threshold β OPEN
OPEN β All requests immediately fail/fallback
After wait duration β HALF_OPEN
HALF_OPEN β Allow limited requests through
If successful β CLOSED
If failing β OPEN# Resilience4j configuration
resilience4j:
circuitbreaker:
instances:
paymentService:
slidingWindowSize: 100
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 10
slowCallDurationThreshold: 2s
slowCallRateThreshold: 80Pattern 2: Bulkhead isolation
Prevent one failing component from consuming all resources:
// Thread pool bulkhead: isolate payment calls to 10 threads max
@Bulkhead(name = "paymentService", type = Bulkhead.Type.THREADPOOL)
public CompletableFuture<PaymentResult> processPayment(PaymentRequest req) {
return CompletableFuture.supplyAsync(() -> paymentClient.charge(req));
}resilience4j:
bulkhead:
instances:
paymentService:
maxConcurrentCalls: 10
maxWaitDuration: 500msIn Kubernetes, bulkheads map to resource limits and namespace isolation:
# Namespace-level resource quotas as bulkheads
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "50"Team Alphaβs runaway pod cannot starve the cluster. The blast radius is contained to their namespace.
Pattern 3: Retry with exponential backoff and jitter
Retries are essential but dangerous. Naive retries create thundering herds:
// WRONG: immediate retry storms
for (int i = 0; i < 3; i++) {
try { return callService(); }
catch (Exception e) { /* retry immediately */ }
}
// RIGHT: exponential backoff with jitter
@Retry(name = "inventoryService")
public InventoryStatus checkStock(String sku) {
return inventoryClient.getStatus(sku);
}resilience4j:
retry:
instances:
inventoryService:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
enableRandomizedWait: true # Jitter!
randomizedWaitFactor: 0.5
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.BusinessValidationExceptionThe jitter is critical. Without it, all retries from all clients hit the recovering service at the exact same intervals, creating synchronized waves of load.
Pattern 4: Timeout budgets
Every call should have a timeout. But individual timeouts are not enough β you need a budget for the entire request:
// Request budget: 3 seconds total
// If auth takes 1s and inventory takes 1.5s,
// payment only gets 0.5s before we fail the whole request
@Component
public class RequestBudget {
private static final ThreadLocal<Deadline> deadline = new ThreadLocal<>();
public static void start(Duration budget) {
deadline.set(Deadline.after(budget));
}
public static Duration remaining() {
Deadline d = deadline.get();
return d != null ? d.remaining() : Duration.ofSeconds(30);
}
}
// Each service call uses the remaining budget
public OrderResult createOrder(OrderRequest request) {
RequestBudget.start(Duration.ofSeconds(3));
User user = authService.validate(request.getToken());
// Budget: ~2.5s remaining
InventoryStatus stock = inventoryService.check(request.getSku());
// Budget: ~1s remaining
PaymentResult payment = paymentService.charge(
request.getAmount(),
RequestBudget.remaining() // Use whatever is left
);
return new OrderResult(user, stock, payment);
}Pattern 5: Multi-AZ and multi-region
Infrastructure-level resiliency:
# Kubernetes: spread pods across availability zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostnameThis ensures your 6 replicas spread evenly across zones and avoid co-locating on the same node.
Pattern 6: Health checks that mean something
# Kubernetes: three types of health checks
livenessProbe:
httpGet:
path: /healthz # Is the process alive?
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3 # Kill after 3 failures
readinessProbe:
httpGet:
path: /ready # Can it serve traffic?
port: 8080
periodSeconds: 5
failureThreshold: 2 # Remove from service after 2 failures
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Allow 5 minutes to start
periodSeconds: 10The /ready endpoint should check actual dependencies:
@GetMapping("/ready")
public ResponseEntity<Map<String, String>> ready() {
Map<String, String> checks = new LinkedHashMap<>();
checks.put("database", checkDatabase() ? "UP" : "DOWN");
checks.put("cache", checkRedis() ? "UP" : "DOWN");
checks.put("queue", checkKafka() ? "UP" : "DOWN");
boolean allUp = checks.values().stream().allMatch("UP"::equals);
return ResponseEntity
.status(allUp ? 200 : 503)
.body(checks);
}Chaos engineering: proving resiliency works
You do not know if your resiliency patterns work until you test them under failure. Chaos engineering does this systematically:
# Chaos Mesh: kill a random pod every hour
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
scheduler:
cron: "@every 1h"# Network chaos: add latency between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "500ms"
jitter: "100ms"
duration: "5m"For application-level chaos, Toxiproxy lets you inject failures in your test suite.
Measuring resiliency with SLOs
Define what βreliable enoughβ means with Service Level Objectives:
slos:
api_availability:
description: "API returns successful responses"
target: 99.9% # 8.7 hours downtime per year
window: 30d
sli: |
sum(rate(http_requests_total{code=~"2.."}[5m])) /
sum(rate(http_requests_total[5m]))
api_latency:
description: "API responds within acceptable time"
target: 99%
window: 30d
sli: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
< 0.5 # p95 under 500ms
data_durability:
description: "No data loss events"
target: 99.999%
window: 365dThe error budget is what makes SLOs actionable: if you have 0.1% error budget (99.9% target) and you have used 80% of it this month, slow down deployments. If you have plenty of budget left, ship faster.
The resiliency checklist
Apply these in order of impact:
- Health checks on every service (liveness + readiness)
- Timeouts on every external call (no infinite waits)
- Retries with backoff and jitter on idempotent operations
- Circuit breakers on critical dependencies
- Bulkheads between tenants and between service pools
- Multi-AZ deployment with topology spread
- Automated failover for databases and stateful services
- Chaos testing in staging, then production
- SLOs and error budgets for every user-facing service
- Runbooks for every alert that pages someone
Building resilient infrastructure for your platform? Letβs talk about architecture reviews, chaos engineering programs, and SRE practices.