advanced ~90 min updated 2026-06-01
Chaos Engineering Experiment
Run your first chaos experiment with Chaos Mesh on a kind cluster: define a steady state, kill pods of a live service with a PodChaos manifest, and verify the system self-heals.
Objective
Install Chaos Mesh, run a pod-kill experiment against a replicated NGINX service under load, and prove availability holds via a steady-state check. The experiment follows the classic chaos loop: define steady state, inject failure, observe, and learn.
Prerequisites
- Docker installed and running
- kind, kubectl, and Helm v3 installed
- Solid Kubernetes fundamentals: Deployments, Services, readiness probes
- A terminal with two available windows or tmux
Architecture
Chaos Mesh installs a controller-manager and per-node daemons that can inject faults. A PodChaos custom resource selects target pods by namespace and label and kills one pod every fixed interval. The target is a 3-replica NGINX Deployment behind a ClusterIP Service; a load generator pod continuously curls the Service so you can measure error rate during the experiment.
+------------------------- kind cluster --------------------------+
| chaos-mesh ns: controller-manager + chaos-daemon (per node) |
| | |
| | PodChaos: kill 1 pod / 30s (label app=web) |
| v |
| target ns: Deployment web (3 replicas) <-- Service web |
| ^ ^ |
| ReplicaSet recreates pods loadgen pod (curl) |
+------------------------------------------------------------------+
Steps
1. Create the cluster and the target workload
kind create cluster --name chaos-lab
kubectl create namespace target
kubectl create deployment web --image=nginx:1.27-alpine --replicas=3 -n target
kubectl expose deployment web --port=80 -n target
kubectl wait --for=condition=available deployment/web -n target --timeout=120s
2. Install Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh --create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--wait
kubectl get pods -n chaos-mesh
Note: kind uses containerd, so the runtime/socket overrides above are required.
3. Establish the steady state with a load generator
kubectl run loadgen -n target --image=busybox --restart=Never -- \
sh -c 'ok=0; fail=0; while true; do
if wget -q -T 2 -O /dev/null http://web.target.svc; then ok=$((ok+1)); else fail=$((fail+1)); fi;
echo "ok=$ok fail=$fail"; sleep 0.5; done'
sleep 15
kubectl logs loadgen -n target --tail=3
Steady state hypothesis: error count stays at or near zero while pods are killed, because 3 replicas plus the Service spread traffic across healthy endpoints.
4. Define the PodChaos experiment
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: web-pod-kill
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
selector:
namespaces:
- target
labelSelectors:
app: web
duration: "3m"
scheduler: {}
Apply it, then watch pods being killed and recreated:
kubectl apply -f pod-kill.yaml
kubectl get pods -n target -w
5. Observe during the experiment
In a second terminal:
kubectl describe podchaos web-pod-kill -n chaos-mesh | tail -20
kubectl logs loadgen -n target --tail=5
kubectl get events -n target --sort-by=.lastTimestamp | tail -10
6. Conclude: verify the hypothesis
# After the 3 minute duration elapses:
kubectl logs loadgen -n target --tail=1
kubectl get deployment web -n target
If fail stayed at or near 0 while pods restarted, the hypothesis holds. Try lowering replicas to 1 and re-running the experiment to see the steady state break — that contrast is the lesson.
Expected output
$ kubectl get pods -n target -w
NAME READY STATUS AGE
web-5f7d8c6b9-abc12 1/1 Running 4m
web-5f7d8c6b9-def34 1/1 Terminating 4m <- killed by chaos
web-5f7d8c6b9-ghi56 1/1 Running 4m
web-5f7d8c6b9-jkl78 0/1 ContainerCreating 1s
web-5f7d8c6b9-jkl78 1/1 Running 4s
$ kubectl logs loadgen -n target --tail=1
ok=327 fail=1
$ kubectl describe podchaos web-pod-kill -n chaos-mesh | tail -3
Type Reason Message
Normal Applied Successfully apply chaos for target/web-5f7d8c6b9-def34
Troubleshooting
- chaos-daemon pods CrashLoopBackOff on kind: wrong container runtime socket. Reinstall with
--set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock. - PodChaos created but nothing dies: the selector matched zero pods. Check labels:
kubectl get pods -n target --show-labels—kubectl create deploymentsetsapp=web. admission webhook denied the request: Chaos Mesh controller is not Ready yet. Wait forkubectl get pods -n chaos-meshto show all Running, then re-apply.- loadgen shows many failures even before chaos: the Service has unready endpoints or DNS is failing. Verify
kubectl get endpoints web -n targetlists 3 addresses. - Experiment never ends: a missing
durationmakes some chaos types run until deleted. Delete it manually:kubectl delete podchaos web-pod-kill -n chaos-mesh.
Cleanup
kubectl delete -f pod-kill.yaml --ignore-not-found
kubectl delete pod loadgen -n target --ignore-not-found
helm uninstall chaos-mesh -n chaos-mesh
kubectl delete namespace chaos-mesh target
kind delete cluster --name chaos-lab
rm -f pod-kill.yaml