Skip to content

advanced ~90 min updated 2026-06-01

Chaos Engineering Experiment

Run your first chaos experiment with Chaos Mesh on a kind cluster: define a steady state, kill pods of a live service with a PodChaos manifest, and verify the system self-heals.

Objective

Install Chaos Mesh, run a pod-kill experiment against a replicated NGINX service under load, and prove availability holds via a steady-state check. The experiment follows the classic chaos loop: define steady state, inject failure, observe, and learn.

Prerequisites

  • Docker installed and running
  • kind, kubectl, and Helm v3 installed
  • Solid Kubernetes fundamentals: Deployments, Services, readiness probes
  • A terminal with two available windows or tmux

Architecture

Chaos Mesh installs a controller-manager and per-node daemons that can inject faults. A PodChaos custom resource selects target pods by namespace and label and kills one pod every fixed interval. The target is a 3-replica NGINX Deployment behind a ClusterIP Service; a load generator pod continuously curls the Service so you can measure error rate during the experiment.

+------------------------- kind cluster --------------------------+
| chaos-mesh ns: controller-manager + chaos-daemon (per node)     |
|        |                                                        |
|        | PodChaos: kill 1 pod / 30s (label app=web)             |
|        v                                                        |
| target ns: Deployment web (3 replicas) <-- Service web          |
|                         ^                       ^               |
|              ReplicaSet recreates pods    loadgen pod (curl)    |
+------------------------------------------------------------------+

Steps

1. Create the cluster and the target workload

kind create cluster --name chaos-lab
kubectl create namespace target
kubectl create deployment web --image=nginx:1.27-alpine --replicas=3 -n target
kubectl expose deployment web --port=80 -n target
kubectl wait --for=condition=available deployment/web -n target --timeout=120s

2. Install Chaos Mesh

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --wait
kubectl get pods -n chaos-mesh

Note: kind uses containerd, so the runtime/socket overrides above are required.

3. Establish the steady state with a load generator

kubectl run loadgen -n target --image=busybox --restart=Never -- \
  sh -c 'ok=0; fail=0; while true; do
    if wget -q -T 2 -O /dev/null http://web.target.svc; then ok=$((ok+1)); else fail=$((fail+1)); fi;
    echo "ok=$ok fail=$fail"; sleep 0.5; done'
sleep 15
kubectl logs loadgen -n target --tail=3

Steady state hypothesis: error count stays at or near zero while pods are killed, because 3 replicas plus the Service spread traffic across healthy endpoints.

4. Define the PodChaos experiment

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: web-pod-kill
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - target
    labelSelectors:
      app: web
  duration: "3m"
  scheduler: {}

Apply it, then watch pods being killed and recreated:

kubectl apply -f pod-kill.yaml
kubectl get pods -n target -w

5. Observe during the experiment

In a second terminal:

kubectl describe podchaos web-pod-kill -n chaos-mesh | tail -20
kubectl logs loadgen -n target --tail=5
kubectl get events -n target --sort-by=.lastTimestamp | tail -10

6. Conclude: verify the hypothesis

# After the 3 minute duration elapses:
kubectl logs loadgen -n target --tail=1
kubectl get deployment web -n target

If fail stayed at or near 0 while pods restarted, the hypothesis holds. Try lowering replicas to 1 and re-running the experiment to see the steady state break — that contrast is the lesson.

Expected output

$ kubectl get pods -n target -w
NAME                   READY   STATUS        AGE
web-5f7d8c6b9-abc12    1/1     Running       4m
web-5f7d8c6b9-def34    1/1     Terminating   4m   <- killed by chaos
web-5f7d8c6b9-ghi56    1/1     Running       4m
web-5f7d8c6b9-jkl78    0/1     ContainerCreating  1s
web-5f7d8c6b9-jkl78    1/1     Running       4s

$ kubectl logs loadgen -n target --tail=1
ok=327 fail=1

$ kubectl describe podchaos web-pod-kill -n chaos-mesh | tail -3
  Type    Reason   Message
  Normal  Applied  Successfully apply chaos for target/web-5f7d8c6b9-def34

Troubleshooting

  • chaos-daemon pods CrashLoopBackOff on kind: wrong container runtime socket. Reinstall with --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock.
  • PodChaos created but nothing dies: the selector matched zero pods. Check labels: kubectl get pods -n target --show-labelskubectl create deployment sets app=web.
  • admission webhook denied the request: Chaos Mesh controller is not Ready yet. Wait for kubectl get pods -n chaos-mesh to show all Running, then re-apply.
  • loadgen shows many failures even before chaos: the Service has unready endpoints or DNS is failing. Verify kubectl get endpoints web -n target lists 3 addresses.
  • Experiment never ends: a missing duration makes some chaos types run until deleted. Delete it manually: kubectl delete podchaos web-pod-kill -n chaos-mesh.

Cleanup

kubectl delete -f pod-kill.yaml --ignore-not-found
kubectl delete pod loadgen -n target --ignore-not-found
helm uninstall chaos-mesh -n chaos-mesh
kubectl delete namespace chaos-mesh target
kind delete cluster --name chaos-lab
rm -f pod-kill.yaml