Skip to content

Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

Resilience Engineering 90-Day Learning Path

Master Resilience Engineering in 90 days — fault tolerance patterns, chaos engineering, disaster recovery, and adaptive capacity. Build systems that survive the unexpected and recover fast.

What Resilience Engineering means

Resilience Engineering focuses on designing and operating systems that can absorb and recover from unexpected failures. It draws on chaos engineering, fault tolerance patterns, and adaptive capacity theory to build systems that survive hardware failures, network partitions, cascading failures, and human errors. Resilience engineers proactively explore failure modes rather than waiting for production incidents to reveal them.

Who should follow this path

  • SREs and reliability engineers designing fault tolerance
  • Architects evaluating system resilience trade-offs
  • Platform engineers building self-healing infrastructure
  • DevOps engineers preparing systems for chaos testing

Prerequisites

  • Production experience with distributed systems
  • Understanding of SLOs and reliability targets
  • Familiarity with cloud infrastructure services
  • Basic programming or scripting for automation

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

  • Resilience engineering theory and origins (Hollnagel)
  • Failure mode and effects analysis (FMEA)
  • Distributed systems failure taxonomy
  • The CAP theorem and partition tolerance
  • Cascading failure patterns and blast radius

Outcome: Conduct an FMEA for a distributed system and identify the top five failure modes.

Days 16–30: Core concepts

  • Fault tolerance patterns: circuit breakers, retry, bulkhead
  • Timeout and deadline propagation
  • Idempotency and at-least-once delivery
  • Fallback and degraded mode design
  • Health check and readiness probe design

Outcome: Implement circuit breaker and bulkhead patterns in a microservices application.

Days 31–45: Tools and workflows

  • Chaos engineering principles and safety practices
  • Chaos Monkey and Chaos Toolkit
  • Litmus Chaos for Kubernetes workloads
  • Steady-state hypothesis definition
  • Chaos experiment documentation and review

Outcome: Design and execute a safe chaos experiment with steady-state verification.

Days 46–60: Hands-on projects

  • Disaster recovery tiers: RTO and RPO targets
  • Multi-region active-active and active-passive patterns
  • Database replication and failover automation
  • Backup verification and restore testing
  • DR runbook design and tabletop exercises

Outcome: Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks.

Days 61–75: Advanced practices

  • Adaptive capacity and system boundaries
  • Load shedding and backpressure patterns
  • Graceful degradation and feature flag kill switches
  • Observability for resilience (SLO burn rates, latency percentiles)
  • Game days for resilience validation

Outcome: Implement load shedding and graceful degradation for high-traffic failure scenarios.

Days 76–90: Portfolio, interview & certification prep

  • Portfolio: resilience engineering case study
  • Resilience engineering interview preparation
  • AWS Certified Solutions Architect Professional exam prep
  • Building a chaos engineering program roadmap
  • Presenting resilience posture to stakeholders

Outcome: Present a resilience engineering portfolio demonstrating chaos experiments and DR validation.

Weekly outcomes at a glance

PhaseOutcome
Days 1–15Conduct an FMEA for a distributed system and identify the top five failure modes.
Days 16–30Implement circuit breaker and bulkhead patterns in a microservices application.
Days 31–45Design and execute a safe chaos experiment with steady-state verification.
Days 46–60Design a disaster recovery plan with validated RTO/RPO targets and tested runbooks.
Days 61–75Implement load shedding and graceful degradation for high-traffic failure scenarios.
Days 76–90Present a resilience engineering portfolio demonstrating chaos experiments and DR validation.

Tools to learn

  • Chaos Toolkit
  • Litmus Chaos
  • Gremlin
  • Resilience4j
  • Istio
  • AWS Fault Injection Simulator
  • Prometheus
  • Grafana
  • Envoy
  • k6

Labs to practice

Mini projects

  • Design and run a chaos engineering program covering five failure modes for a microservices app
  • Implement circuit breaker and bulkhead patterns with Resilience4j in a Java service
  • Build a multi-region active-passive DR setup with automated failover validation

Interview questions to prepare

  1. What is the difference between fault tolerance and resilience?
  2. Explain the circuit breaker pattern and when it should open.
  3. What is a steady-state hypothesis in chaos engineering?
  4. How do RTO and RPO differ and how do you measure them?
  5. Describe a cascading failure you have seen or studied — what were the contributing factors?
  6. What is load shedding and how does it protect a system under overload?
  7. How do you safely run chaos experiments in production?

Certification suggestions

  • AWS Certified Solutions Architect – Professional — AWS
  • Certified Kubernetes Administrator (CKA) — CNCF
  • Google Professional Cloud Architect — Google Cloud

Browse the full certification registry for exam details and official links.

Free resources

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗