Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

Resilience Engineering 90-Day Learning Path

Master Resilience Engineering in 90 days — fault tolerance patterns, chaos engineering, disaster recovery, and adaptive capacity. Build systems that survive the unexpected and recover fast.

What Resilience Engineering means

Resilience Engineering focuses on designing and operating systems that can absorb and recover from unexpected failures. It draws on chaos engineering, fault tolerance patterns, and adaptive capacity theory to build systems that survive hardware failures, network partitions, cascading failures, and human errors. Resilience engineers proactively explore failure modes rather than waiting for production incidents to reveal them.

Who should follow this path

SREs and reliability engineers designing fault tolerance
Architects evaluating system resilience trade-offs
Platform engineers building self-healing infrastructure
DevOps engineers preparing systems for chaos testing

Prerequisites

Production experience with distributed systems
Understanding of SLOs and reliability targets
Familiarity with cloud infrastructure services
Basic programming or scripting for automation

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

Resilience engineering theory and origins (Hollnagel)
Failure mode and effects analysis (FMEA)
Distributed systems failure taxonomy
The CAP theorem and partition tolerance
Cascading failure patterns and blast radius

Outcome: Conduct an FMEA for a distributed system and identify the top five failure modes.

Days 16–30: Core concepts

Fault tolerance patterns: circuit breakers, retry, bulkhead
Timeout and deadline propagation
Idempotency and at-least-once delivery
Fallback and degraded mode design
Health check and readiness probe design

Outcome: Implement circuit breaker and bulkhead patterns in a microservices application.

Days 31–45: Tools and workflows

Chaos engineering principles and safety practices
Chaos Monkey and Chaos Toolkit
Litmus Chaos for Kubernetes workloads
Steady-state hypothesis definition
Chaos experiment documentation and review

Outcome: Design and execute a safe chaos experiment with steady-state verification.

Days 46–60: Hands-on projects

Disaster recovery tiers: RTO and RPO targets
Multi-region active-active and active-passive patterns
Database replication and failover automation
Backup verification and restore testing
DR runbook design and tabletop exercises

Outcome: Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks.

Days 61–75: Advanced practices

Adaptive capacity and system boundaries
Load shedding and backpressure patterns
Graceful degradation and feature flag kill switches
Observability for resilience (SLO burn rates, latency percentiles)
Game days for resilience validation

Outcome: Implement load shedding and graceful degradation for high-traffic failure scenarios.

Days 76–90: Portfolio, interview & certification prep

Portfolio: resilience engineering case study
Resilience engineering interview preparation
AWS Certified Solutions Architect Professional exam prep
Building a chaos engineering program roadmap
Presenting resilience posture to stakeholders

Outcome: Present a resilience engineering portfolio demonstrating chaos experiments and DR validation.

Weekly outcomes at a glance

Phase	Outcome
Days 1–15	Conduct an FMEA for a distributed system and identify the top five failure modes.
Days 16–30	Implement circuit breaker and bulkhead patterns in a microservices application.
Days 31–45	Design and execute a safe chaos experiment with steady-state verification.
Days 46–60	Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks.
Days 61–75	Implement load shedding and graceful degradation for high-traffic failure scenarios.
Days 76–90	Present a resilience engineering portfolio demonstrating chaos experiments and DR validation.

Tools to learn

Chaos Toolkit
Litmus Chaos
Gremlin
Resilience4j
Istio
AWS Fault Injection Simulator
Prometheus
Grafana
Envoy
k6

Labs to practice

Mini projects

Design and run a chaos engineering program covering five failure modes for a microservices app
Implement circuit breaker and bulkhead patterns with Resilience4j in a Java service
Build a multi-region active-passive DR setup with automated failover validation

Interview questions to prepare

What is the difference between fault tolerance and resilience?
Explain the circuit breaker pattern and when it should open.
What is a steady-state hypothesis in chaos engineering?
How do RTO and RPO differ and how do you measure them?
Describe a cascading failure you have seen or studied — what were the contributing factors?
What is load shedding and how does it protect a system under overload?
How do you safely run chaos experiments in production?

Certification suggestions

AWS Certified Solutions Architect – Professional — AWS
Certified Kubernetes Administrator (CKA) — CNCF
Google Professional Cloud Architect — Google Cloud

Browse the full certification registry for exam details and official links.

Free resources

// instructor-led option

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗