Skip to content

Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

Chaos Engineering 90-Day Learning Path

Master Chaos Engineering in 90 days — experiment design, blast radius control, steady-state hypothesis testing, and chaos tooling. Build confidence in system resilience through controlled failure injection.

What Chaos Engineering means

Chaos Engineering is the discipline of intentionally injecting failures into systems to uncover weaknesses that would otherwise only emerge during real incidents. Practitioners define a steady-state hypothesis, inject targeted failures, observe system behavior, and compare against the hypothesis. The result is empirical confidence in resilience — not just assumed reliability.

Who should follow this path

  • SREs validating system resilience assumptions
  • Platform engineers testing infrastructure reliability
  • DevOps engineers hardening microservices architectures
  • Architects verifying distributed system fault tolerance

Prerequisites

  • Experience with production distributed systems
  • Basic understanding of SLOs and reliability targets
  • Familiarity with Kubernetes and containerized workloads
  • Understanding of monitoring and observability tools

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

  • Chaos engineering principles (Netflix origins and Principles of Chaos)
  • Steady-state hypothesis definition and measurement
  • Blast radius control and safety controls
  • Chaos experiment lifecycle (hypothesis, inject, observe, conclude)
  • Chaos program maturity model

Outcome: Write a chaos experiment plan with a defined steady-state hypothesis and safety controls.

Days 16–30: Core concepts

  • Chaos Monkey and Simian Army tools
  • Chaos Toolkit framework and experiment YAML
  • Litmus Chaos operator for Kubernetes
  • Gremlin platform overview
  • AWS Fault Injection Simulator (FIS)

Outcome: Run a pod termination chaos experiment on Kubernetes using Chaos Toolkit or Litmus.

Days 31–45: Tools and workflows

  • Network chaos: latency, packet loss, and partitions
  • Resource chaos: CPU throttling and memory pressure
  • Application chaos: exception injection and latency
  • Infrastructure chaos: node termination and AZ failure
  • Blast radius scoping with labels and namespaces

Outcome: Execute network latency and resource exhaustion experiments with controlled blast radius.

Days 46–60: Hands-on projects

  • Observability during chaos experiments
  • Correlating chaos events with SLO burn rates
  • Automated experiment analysis and pass/fail gates
  • Regression testing with continuous chaos
  • Chaos experiment result documentation

Outcome: Automate chaos experiment execution and link results to SLO impact metrics.

Days 61–75: Advanced practices

  • GameDay design and facilitation
  • Chaos engineering program governance
  • Chaos experiments in CI/CD pipelines
  • Cross-team chaos coordination
  • Security chaos: credential rotation and IAM disruption

Outcome: Run a company-wide GameDay and integrate chaos experiments into a staging pipeline.

Days 76–90: Portfolio, interview & certification prep

  • Portfolio: chaos engineering program case study
  • Chaos engineering interview preparation
  • AWS Certified Solutions Architect Professional exam prep
  • Contributing to Litmus Chaos open source
  • Building a chaos engineering charter and roadmap

Outcome: Present a mature chaos engineering program and demonstrate quantified resilience improvements.

Weekly outcomes at a glance

PhaseOutcome
Days 1–15Write a chaos experiment plan with a defined steady-state hypothesis and safety controls.
Days 16–30Run a pod termination chaos experiment on Kubernetes using Chaos Toolkit or Litmus.
Days 31–45Execute network latency and resource exhaustion experiments with controlled blast radius.
Days 46–60Automate chaos experiment execution and link results to SLO impact metrics.
Days 61–75Run a company-wide GameDay and integrate chaos experiments into a staging pipeline.
Days 76–90Present a mature chaos engineering program and demonstrate quantified resilience improvements.

Tools to learn

  • Chaos Toolkit
  • Litmus Chaos
  • Gremlin
  • AWS Fault Injection Simulator
  • Chaos Monkey
  • Prometheus
  • Grafana
  • Kubernetes
  • k6
  • Envoy

Labs to practice

Mini projects

  • Design and execute a GameDay covering five failure modes for a Kubernetes microservices app
  • Build a continuous chaos testing pipeline that runs nightly in staging
  • Create a chaos experiment library covering network, resource, and application fault categories

Interview questions to prepare

  1. What is the steady-state hypothesis and why is it central to chaos engineering?
  2. How do you control blast radius when running chaos experiments in production?
  3. What is the difference between Chaos Toolkit and Litmus Chaos?
  4. How do you decide which failure modes to prioritize in a new chaos program?
  5. What safety controls do you put in place before running chaos experiments?
  6. How do you automate chaos experiments in a CI/CD pipeline safely?
  7. Describe a chaos experiment that revealed an unexpected system weakness.

Certification suggestions

  • AWS Certified Solutions Architect – Professional — AWS
  • Certified Kubernetes Administrator (CKA) — CNCF
  • Google Professional Cloud Architect — Google Cloud

Browse the full certification registry for exam details and official links.

Free resources

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗