Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01
Resilience Engineering 90-Day Learning Path
Master Resilience Engineering in 90 days — fault tolerance patterns, chaos engineering, disaster recovery, and adaptive capacity. Build systems that survive the unexpected and recover fast.
What Resilience Engineering means
Resilience Engineering focuses on designing and operating systems that can absorb and recover from unexpected failures. It draws on chaos engineering, fault tolerance patterns, and adaptive capacity theory to build systems that survive hardware failures, network partitions, cascading failures, and human errors. Resilience engineers proactively explore failure modes rather than waiting for production incidents to reveal them.
Who should follow this path
- SREs and reliability engineers designing fault tolerance
- Architects evaluating system resilience trade-offs
- Platform engineers building self-healing infrastructure
- DevOps engineers preparing systems for chaos testing
Prerequisites
- Production experience with distributed systems
- Understanding of SLOs and reliability targets
- Familiarity with cloud infrastructure services
- Basic programming or scripting for automation
The 90-day plan
Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.
Days 1–15: Foundation
- Resilience engineering theory and origins (Hollnagel)
- Failure mode and effects analysis (FMEA)
- Distributed systems failure taxonomy
- The CAP theorem and partition tolerance
- Cascading failure patterns and blast radius
Outcome: Conduct an FMEA for a distributed system and identify the top five failure modes.
Days 16–30: Core concepts
- Fault tolerance patterns: circuit breakers, retry, bulkhead
- Timeout and deadline propagation
- Idempotency and at-least-once delivery
- Fallback and degraded mode design
- Health check and readiness probe design
Outcome: Implement circuit breaker and bulkhead patterns in a microservices application.
Days 31–45: Tools and workflows
- Chaos engineering principles and safety practices
- Chaos Monkey and Chaos Toolkit
- Litmus Chaos for Kubernetes workloads
- Steady-state hypothesis definition
- Chaos experiment documentation and review
Outcome: Design and execute a safe chaos experiment with steady-state verification.
Days 46–60: Hands-on projects
- Disaster recovery tiers: RTO and RPO targets
- Multi-region active-active and active-passive patterns
- Database replication and failover automation
- Backup verification and restore testing
- DR runbook design and tabletop exercises
Outcome: Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks.
Days 61–75: Advanced practices
- Adaptive capacity and system boundaries
- Load shedding and backpressure patterns
- Graceful degradation and feature flag kill switches
- Observability for resilience (SLO burn rates, latency percentiles)
- Game days for resilience validation
Outcome: Implement load shedding and graceful degradation for high-traffic failure scenarios.
Days 76–90: Portfolio, interview & certification prep
- Portfolio: resilience engineering case study
- Resilience engineering interview preparation
- AWS Certified Solutions Architect Professional exam prep
- Building a chaos engineering program roadmap
- Presenting resilience posture to stakeholders
Outcome: Present a resilience engineering portfolio demonstrating chaos experiments and DR validation.
Weekly outcomes at a glance
| Phase | Outcome |
|---|---|
| Days 1–15 | Conduct an FMEA for a distributed system and identify the top five failure modes. |
| Days 16–30 | Implement circuit breaker and bulkhead patterns in a microservices application. |
| Days 31–45 | Design and execute a safe chaos experiment with steady-state verification. |
| Days 46–60 | Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks. |
| Days 61–75 | Implement load shedding and graceful degradation for high-traffic failure scenarios. |
| Days 76–90 | Present a resilience engineering portfolio demonstrating chaos experiments and DR validation. |
Tools to learn
- Chaos Toolkit
- Litmus Chaos
- Gremlin
- Resilience4j
- Istio
- AWS Fault Injection Simulator
- Prometheus
- Grafana
- Envoy
- k6
Labs to practice
- Chaos Engineering Experiment
- Prometheus and Grafana Monitoring
- Kubernetes Deployment with Helm
- OpenTelemetry Tracing
Mini projects
- Design and run a chaos engineering program covering five failure modes for a microservices app
- Implement circuit breaker and bulkhead patterns with Resilience4j in a Java service
- Build a multi-region active-passive DR setup with automated failover validation
Interview questions to prepare
- What is the difference between fault tolerance and resilience?
- Explain the circuit breaker pattern and when it should open.
- What is a steady-state hypothesis in chaos engineering?
- How do RTO and RPO differ and how do you measure them?
- Describe a cascading failure you have seen or studied — what were the contributing factors?
- What is load shedding and how does it protect a system under overload?
- How do you safely run chaos experiments in production?
Certification suggestions
- AWS Certified Solutions Architect – Professional — AWS
- Certified Kubernetes Administrator (CKA) — CNCF
- Google Professional Cloud Architect — Google Cloud
Browse the full certification registry for exam details and official links.
Free resources
- Chaos Engineering Book (free online)
- Litmus Chaos Documentation
- AWS Fault Injection Simulator Docs
- Netflix Chaos Engineering Blog
- Resilience4j Documentation
Related roadmaps
Related tool categories
- Chaos Engineering Tools
- SRE Tools
- Monitoring Tools
- Observability Tools
- Backup and Disaster Recovery Tools
// instructor-led option
Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.
Explore paid training on DevOpsSchool.com ↗