Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

Observability Engineering 90-Day Learning Path

Master Observability Engineering in 90 days — the three pillars (metrics, logs, traces), OpenTelemetry instrumentation, cardinality management, and SLO-driven alerting. Know your systems inside out.

What Observability Engineering means

Observability Engineering is the practice of instrumenting systems so their internal state can be inferred from external outputs. The three pillars — metrics, logs, and distributed traces — provide complementary views of system behavior. Modern observability centers on OpenTelemetry as the vendor-neutral standard, with high-cardinality event data enabling fast debugging of novel failure modes.

Who should follow this path

SREs building observability platforms
DevOps engineers instrumenting applications and infrastructure
Software engineers adding telemetry to their services
Platform engineers designing observability pipelines

Prerequisites

Experience with at least one monitoring tool (Prometheus, Datadog, etc.)
Basic understanding of distributed systems
Familiarity with application development in any language
Understanding of CI/CD and deployment pipelines

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

The three pillars of observability: metrics, logs, and traces
Observability vs monitoring distinction
Structured logging principles and JSON log formats
Metrics types: counters, gauges, histograms, and summaries
Distributed tracing concepts: spans, traces, and context propagation

Outcome: Explain the three observability pillars and instrument a simple service with all three.

Days 16–30: Core concepts

OpenTelemetry architecture: SDK, API, and collector
Auto-instrumentation vs manual instrumentation
OTLP protocol and exporter configuration
OpenTelemetry Collector pipeline design
Cardinality: what it is and why it matters for cost

Outcome: Instrument a microservice with OpenTelemetry and route telemetry through the Collector.

Days 31–45: Tools and workflows

Prometheus metrics collection and PromQL fundamentals
Grafana dashboard design for observability
Loki for log aggregation and LogQL queries
Jaeger and Tempo for distributed trace storage
Exemplars: connecting metrics to traces

Outcome: Build a full observability stack (metrics, logs, traces) using Prometheus, Loki, and Tempo.

Days 46–60: Hands-on projects

SLO-driven alerting design
Burn rate alerts and multi-window multi-burn-rate rules
Alert fatigue reduction strategies
Correlation across pillars for faster MTTR
High-cardinality event exploration with Honeycomb

Outcome: Design SLO burn rate alerts and use cross-pillar correlation to resolve incidents faster.

Days 61–75: Advanced practices

Observability platform architecture and scalability
Thanos and Cortex for long-term Prometheus storage
OpenTelemetry at scale: sampling strategies
Cost management for observability data (cardinality and retention)
eBPF-based observability with Cilium and Pixie

Outcome: Scale an observability platform with long-term storage, tail sampling, and cost controls.

Days 76–90: Portfolio, interview & certification prep

Portfolio: observability platform for a microservices app
Observability engineering interview preparation
Certified OpenTelemetry Practitioner exam prep
Contributing to OpenTelemetry open source
Writing observability engineering design documents

Outcome: Present a production-grade observability platform and lead observability architecture discussions.

Weekly outcomes at a glance

Phase	Outcome
Days 1–15	Explain the three observability pillars and instrument a simple service with all three.
Days 16–30	Instrument a microservice with OpenTelemetry and route telemetry through the Collector.
Days 31–45	Build a full observability stack (metrics, logs, traces) using Prometheus, Loki, and Tempo.
Days 46–60	Design SLO burn rate alerts and use cross-pillar correlation to resolve incidents faster.
Days 61–75	Scale an observability platform with long-term storage, tail sampling, and cost controls.
Days 76–90	Present a production-grade observability platform and lead observability architecture discussions.

Tools to learn

Prometheus
Grafana
OpenTelemetry
Jaeger
Tempo
Loki
Datadog
Honeycomb
Thanos
Elastic Stack

Labs to practice

Mini projects

Build a full observability stack for a microservices app with metrics, logs, and traces
Implement OpenTelemetry auto-instrumentation across five polyglot services
Design SLO burn rate alerts with cross-pillar correlation dashboards

Interview questions to prepare

Explain the difference between monitoring and observability.
What is cardinality and why does high cardinality cause problems?
How does OpenTelemetry context propagation work across service boundaries?
What is a burn rate alert and when would you use it?
Describe the trade-offs between head sampling and tail sampling in tracing.
How do exemplars connect metrics to traces?
What is eBPF and how does it enable zero-instrumentation observability?
How would you reduce observability costs by 50 percent without losing signal?

Certification suggestions

Grafana Certified Associate — Grafana Labs
Google Professional Cloud DevOps Engineer — Google Cloud
Certified Kubernetes Administrator (CKA) — CNCF

Browse the full certification registry for exam details and official links.

Free resources

// instructor-led option

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗