Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

Chaos Engineering 90-Day Learning Path

Master Chaos Engineering in 90 days — experiment design, blast radius control, steady-state hypothesis testing, and chaos tooling. Build confidence in system resilience through controlled failure injection.

What Chaos Engineering means

Chaos Engineering is the discipline of intentionally injecting failures into systems to uncover weaknesses that would otherwise only emerge during real incidents. Practitioners define a steady-state hypothesis, inject targeted failures, observe system behavior, and compare against the hypothesis. The result is empirical confidence in resilience — not just assumed reliability.

Who should follow this path

SREs validating system resilience assumptions
Platform engineers testing infrastructure reliability
DevOps engineers hardening microservices architectures
Architects verifying distributed system fault tolerance

Prerequisites

Experience with production distributed systems
Basic understanding of SLOs and reliability targets
Familiarity with Kubernetes and containerized workloads
Understanding of monitoring and observability tools

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

Chaos engineering principles (Netflix origins and Principles of Chaos)
Steady-state hypothesis definition and measurement
Blast radius control and safety controls
Chaos experiment lifecycle (hypothesis, inject, observe, conclude)
Chaos program maturity model

Outcome: Write a chaos experiment plan with a defined steady-state hypothesis and safety controls.

Days 16–30: Core concepts

Chaos Monkey and Simian Army tools
Chaos Toolkit framework and experiment YAML
Litmus Chaos operator for Kubernetes
Gremlin platform overview
AWS Fault Injection Simulator (FIS)

Outcome: Run a pod termination chaos experiment on Kubernetes using Chaos Toolkit or Litmus.

Days 31–45: Tools and workflows

Network chaos: latency, packet loss, and partitions
Resource chaos: CPU throttling and memory pressure
Application chaos: exception injection and latency
Infrastructure chaos: node termination and AZ failure
Blast radius scoping with labels and namespaces

Outcome: Execute network latency and resource exhaustion experiments with controlled blast radius.

Days 46–60: Hands-on projects

Observability during chaos experiments
Correlating chaos events with SLO burn rates
Automated experiment analysis and pass/fail gates
Regression testing with continuous chaos
Chaos experiment result documentation

Outcome: Automate chaos experiment execution and link results to SLO impact metrics.

Days 61–75: Advanced practices

GameDay design and facilitation
Chaos engineering program governance
Chaos experiments in CI/CD pipelines
Cross-team chaos coordination
Security chaos: credential rotation and IAM disruption

Outcome: Run a company-wide GameDay and integrate chaos experiments into a staging pipeline.

Days 76–90: Portfolio, interview & certification prep

Portfolio: chaos engineering program case study
Chaos engineering interview preparation
AWS Certified Solutions Architect Professional exam prep
Contributing to Litmus Chaos open source
Building a chaos engineering charter and roadmap

Outcome: Present a mature chaos engineering program and demonstrate quantified resilience improvements.

Weekly outcomes at a glance

Phase	Outcome
Days 1–15	Write a chaos experiment plan with a defined steady-state hypothesis and safety controls.
Days 16–30	Run a pod termination chaos experiment on Kubernetes using Chaos Toolkit or Litmus.
Days 31–45	Execute network latency and resource exhaustion experiments with controlled blast radius.
Days 46–60	Automate chaos experiment execution and link results to SLO impact metrics.
Days 61–75	Run a company-wide GameDay and integrate chaos experiments into a staging pipeline.
Days 76–90	Present a mature chaos engineering program and demonstrate quantified resilience improvements.

Tools to learn

Chaos Toolkit
Litmus Chaos
Gremlin
AWS Fault Injection Simulator
Chaos Monkey
Prometheus
Grafana
Kubernetes
k6
Envoy

Labs to practice

Mini projects

Design and execute a GameDay covering five failure modes for a Kubernetes microservices app
Build a continuous chaos testing pipeline that runs nightly in staging
Create a chaos experiment library covering network, resource, and application fault categories

Interview questions to prepare

What is the steady-state hypothesis and why is it central to chaos engineering?
How do you control blast radius when running chaos experiments in production?
What is the difference between Chaos Toolkit and Litmus Chaos?
How do you decide which failure modes to prioritize in a new chaos program?
What safety controls do you put in place before running chaos experiments?
How do you automate chaos experiments in a CI/CD pipeline safely?
Describe a chaos experiment that revealed an unexpected system weakness.

Certification suggestions

AWS Certified Solutions Architect – Professional — AWS
Certified Kubernetes Administrator (CKA) — CNCF
Google Professional Cloud Architect — Google Cloud

Browse the full certification registry for exam details and official links.

Free resources

// instructor-led option

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗