Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01
Observability Engineering 90-Day Learning Path
Master Observability Engineering in 90 days — the three pillars (metrics, logs, traces), OpenTelemetry instrumentation, cardinality management, and SLO-driven alerting. Know your systems inside out.
What Observability Engineering means
Observability Engineering is the practice of instrumenting systems so their internal state can be inferred from external outputs. The three pillars — metrics, logs, and distributed traces — provide complementary views of system behavior. Modern observability centers on OpenTelemetry as the vendor-neutral standard, with high-cardinality event data enabling fast debugging of novel failure modes.
Who should follow this path
- SREs building observability platforms
- DevOps engineers instrumenting applications and infrastructure
- Software engineers adding telemetry to their services
- Platform engineers designing observability pipelines
Prerequisites
- Experience with at least one monitoring tool (Prometheus, Datadog, etc.)
- Basic understanding of distributed systems
- Familiarity with application development in any language
- Understanding of CI/CD and deployment pipelines
The 90-day plan
Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.
Days 1–15: Foundation
- The three pillars of observability: metrics, logs, and traces
- Observability vs monitoring distinction
- Structured logging principles and JSON log formats
- Metrics types: counters, gauges, histograms, and summaries
- Distributed tracing concepts: spans, traces, and context propagation
Outcome: Explain the three observability pillars and instrument a simple service with all three.
Days 16–30: Core concepts
- OpenTelemetry architecture: SDK, API, and collector
- Auto-instrumentation vs manual instrumentation
- OTLP protocol and exporter configuration
- OpenTelemetry Collector pipeline design
- Cardinality: what it is and why it matters for cost
Outcome: Instrument a microservice with OpenTelemetry and route telemetry through the Collector.
Days 31–45: Tools and workflows
- Prometheus metrics collection and PromQL fundamentals
- Grafana dashboard design for observability
- Loki for log aggregation and LogQL queries
- Jaeger and Tempo for distributed trace storage
- Exemplars: connecting metrics to traces
Outcome: Build a full observability stack (metrics, logs, traces) using Prometheus, Loki, and Tempo.
Days 46–60: Hands-on projects
- SLO-driven alerting design
- Burn rate alerts and multi-window multi-burn-rate rules
- Alert fatigue reduction strategies
- Correlation across pillars for faster MTTR
- High-cardinality event exploration with Honeycomb
Outcome: Design SLO burn rate alerts and use cross-pillar correlation to resolve incidents faster.
Days 61–75: Advanced practices
- Observability platform architecture and scalability
- Thanos and Cortex for long-term Prometheus storage
- OpenTelemetry at scale: sampling strategies
- Cost management for observability data (cardinality and retention)
- eBPF-based observability with Cilium and Pixie
Outcome: Scale an observability platform with long-term storage, tail sampling, and cost controls.
Days 76–90: Portfolio, interview & certification prep
- Portfolio: observability platform for a microservices app
- Observability engineering interview preparation
- Certified OpenTelemetry Practitioner exam prep
- Contributing to OpenTelemetry open source
- Writing observability engineering design documents
Outcome: Present a production-grade observability platform and lead observability architecture discussions.
Weekly outcomes at a glance
| Phase | Outcome |
|---|---|
| Days 1–15 | Explain the three observability pillars and instrument a simple service with all three. |
| Days 16–30 | Instrument a microservice with OpenTelemetry and route telemetry through the Collector. |
| Days 31–45 | Build a full observability stack (metrics, logs, traces) using Prometheus, Loki, and Tempo. |
| Days 46–60 | Design SLO burn rate alerts and use cross-pillar correlation to resolve incidents faster. |
| Days 61–75 | Scale an observability platform with long-term storage, tail sampling, and cost controls. |
| Days 76–90 | Present a production-grade observability platform and lead observability architecture discussions. |
Tools to learn
- Prometheus
- Grafana
- OpenTelemetry
- Jaeger
- Tempo
- Loki
- Datadog
- Honeycomb
- Thanos
- Elastic Stack
Labs to practice
Mini projects
- Build a full observability stack for a microservices app with metrics, logs, and traces
- Implement OpenTelemetry auto-instrumentation across five polyglot services
- Design SLO burn rate alerts with cross-pillar correlation dashboards
Interview questions to prepare
- Explain the difference between monitoring and observability.
- What is cardinality and why does high cardinality cause problems?
- How does OpenTelemetry context propagation work across service boundaries?
- What is a burn rate alert and when would you use it?
- Describe the trade-offs between head sampling and tail sampling in tracing.
- How do exemplars connect metrics to traces?
- What is eBPF and how does it enable zero-instrumentation observability?
- How would you reduce observability costs by 50 percent without losing signal?
Certification suggestions
- Grafana Certified Associate — Grafana Labs
- Google Professional Cloud DevOps Engineer — Google Cloud
- Certified Kubernetes Administrator (CKA) — CNCF
Browse the full certification registry for exam details and official links.
Free resources
- OpenTelemetry Documentation
- Prometheus Documentation
- Grafana Documentation
- Google SRE Workbook: Alerting
- roadmap.sh Observability Roadmap
Related roadmaps
Related tool categories
// instructor-led option
Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.
Explore paid training on DevOpsSchool.com ↗