{"id":1170,"date":"2026-02-22T10:49:36","date_gmt":"2026-02-22T10:49:36","guid":{"rendered":"https:\/\/devopsschool.org\/blog\/uncategorized\/capacity-planning\/"},"modified":"2026-02-22T10:49:36","modified_gmt":"2026-02-22T10:49:36","slug":"capacity-planning","status":"publish","type":"post","link":"https:\/\/devopsschool.org\/blog\/capacity-planning\/","title":{"rendered":"What is Capacity Planning? Meaning, Examples, Use Cases, and How to use it?"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition<\/h2>\n\n\n\n<p>Capacity planning is the process of forecasting, provisioning, and validating the resources (compute, storage, network, and human processes) needed to meet demand reliably and cost-effectively over time.<\/p>\n\n\n\n<p>Analogy: Think of capacity planning as a stadium manager predicting attendance, assigning seats, arranging staff, and ensuring exits, bathrooms, and concessions scale with crowd size so every event runs safely and profitably.<\/p>\n\n\n\n<p>Formal technical line: Capacity planning combines historical telemetry, workload models, service-level objectives, and cost constraints to produce actionable provisioning and autoscaling decisions that maintain SLO compliance while minimizing wasted capacity.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is Capacity Planning?<\/h2>\n\n\n\n<p>What it is:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A discipline that forecasts demand, maps demand to resource needs, and prescribes provisioning, autoscaling, and runbook actions.<\/li>\n<li>In practice it blends data engineering, SRE practices, financial modeling, and architecture.<\/li>\n<\/ul>\n\n\n\n<p>What it is NOT:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Not just buying more servers or raising quotas without data.<\/li>\n<li>Not only a one-time sizing exercise; it\u2019s continuous and feedback-driven.<\/li>\n<li>Not identical to cost optimization, though closely related.<\/li>\n<\/ul>\n\n\n\n<p>Key properties and constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Time horizon: short-term (minutes\u2013hours autoscaling), mid-term (days\u2013weeks deployments), long-term (months\u2013years architecture capacity).<\/li>\n<li>Granularity: per-service, per-cluster, per-region, per-tenant.<\/li>\n<li>Constraints: budget, quotas, regulatory residency, security boundaries, vendor SLAs.<\/li>\n<li>Uncertainty: demand variance, traffic spikes, dependency failures, release changes.<\/li>\n<\/ul>\n\n\n\n<p>Where it fits in modern cloud\/SRE workflows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Inputs: telemetry, deployment plans, marketing events, product roadmaps, vendor quotas.<\/li>\n<li>Outputs: autoscaling policies, capacity reservations, infrastructure-as-code changes, runbooks, budget forecasts.<\/li>\n<li>Interfaces: product managers, finance, platform engineering, security, on-call SREs, Dev teams.<\/li>\n<\/ul>\n\n\n\n<p>Diagram description (text-only):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Visualize a pipeline left-to-right: Inputs (Telemetry, Roadmap, Events) -&gt; Modeling Engine (Forecasting, Workload Profiles) -&gt; Constraints Layer (Budget, Quotas, Security) -&gt; Decision Engine (Provisioning, Autoscale Policies, Runbooks) -&gt; Execution (IaaS\/PaaS\/K8s\/serverless) -&gt; Feedback Loop (Observability -&gt; Incident\/Postmortem -&gt; Model update).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Capacity Planning in one sentence<\/h3>\n\n\n\n<p>Capacity planning is the continuous process of matching expected service demand to available resources while enforcing SLOs, budget, and operational constraints.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Capacity Planning vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from Capacity Planning<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>Autoscaling<\/td>\n<td>Reactive scaling mechanism not the forecasting process<\/td>\n<td>People assume autoscaling removes planning<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>Cost optimization<\/td>\n<td>Focuses on cost reduction rather than meeting demand<\/td>\n<td>Mistaken as identical to capacity planning<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>Capacity management<\/td>\n<td>Broader ITIL term focusing on assets lifecycle<\/td>\n<td>Often used interchangeably with planning<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Performance engineering<\/td>\n<td>Focuses on software behavior under load not resource forecasting<\/td>\n<td>Believed to replace planning<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Incident response<\/td>\n<td>Reactive troubleshooting not proactive provisioning<\/td>\n<td>Assumed to be same as mitigation planning<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>Demand forecasting<\/td>\n<td>Component of planning focused on prediction only<\/td>\n<td>Confused as full capacity planning<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Resource allocation<\/td>\n<td>Operational assignment of resources not long-term planning<\/td>\n<td>Treated as the whole problem<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>Right-sizing<\/td>\n<td>Optimization activity within planning but narrower<\/td>\n<td>Seen as full strategy rather than a tactic<\/td>\n<\/tr>\n<tr>\n<td>T9<\/td>\n<td>Load testing<\/td>\n<td>Tests capacity limits but not ongoing forecasting<\/td>\n<td>Mistaken as continuous planning<\/td>\n<\/tr>\n<tr>\n<td>T10<\/td>\n<td>SLO management<\/td>\n<td>Defines targets but doesn&#8217;t produce provisioning decisions<\/td>\n<td>Assumed to be sufficient for capacity decisions<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does Capacity Planning matter?<\/h2>\n\n\n\n<p>Business impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Revenue: downtime or throttling during peak events directly equals lost transactions and customer churn.<\/li>\n<li>Trust: consistent performance maintains customer trust and reduces SLA penalty exposure.<\/li>\n<li>Risk: under-provisioning invites outages; over-provisioning wastes capital and slows product investment.<\/li>\n<\/ul>\n\n\n\n<p>Engineering impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incident reduction: proactive capacity planning avoids many load-related incidents.<\/li>\n<li>Velocity: predictable infra reduces emergency work and unplanned rollbacks.<\/li>\n<li>Cost balance: prevents over-allocation while providing buffer for unpredictable demand.<\/li>\n<\/ul>\n\n\n\n<p>SRE framing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs\/SLOs: SLOs drive capacity thresholds; capacity planning focuses on ensuring SLOs are met.<\/li>\n<li>Error budgets: capacity planning uses error budget consumption to decide on safety margins and release windows.<\/li>\n<li>Toil\/on-call: better capacity reduces manual scaling toil and noisy on-call alerts.<\/li>\n<\/ul>\n\n\n\n<p>What breaks in production \u2014 realistic examples:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Global marketing campaign triggers 20x traffic spike; caching tier is exhausted causing high latency and errors.<\/li>\n<li>A scheduled batch job floods DB connections at midnight, causing timeouts for interactive users.<\/li>\n<li>Autoscaler misconfiguration scales too slowly during burst traffic producing increased 5xx rates.<\/li>\n<li>Region quota exhaustion after cluster autoscaler launches many instances, preventing failover setup.<\/li>\n<li>Unexpected third-party API rate limiting causes backlog growth and memory pressure on worker services.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is Capacity Planning used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How Capacity Planning appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge and CDN<\/td>\n<td>Cache sizing, PoP capacity and origin load<\/td>\n<td>cache hit ratio, edge latency, origin traffic<\/td>\n<td>CDN dashboards, logs<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Network<\/td>\n<td>Bandwidth and load balancer capacity planning<\/td>\n<td>throughput, packet loss, ELB 5xx<\/td>\n<td>Network observability tools<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Service \/ API<\/td>\n<td>Concurrency, threads, connection pools<\/td>\n<td>p95 latency, qps, error rates<\/td>\n<td>APM, tracing<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Application<\/td>\n<td>Memory and CPU per process sizing<\/td>\n<td>memory rss, cpu usage, gc pause<\/td>\n<td>APM, metrics<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Data \/ Storage<\/td>\n<td>IOps, storage throughput, partitioning<\/td>\n<td>iops, latency, queue depth<\/td>\n<td>DB monitoring tools<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>Kubernetes<\/td>\n<td>Pod density, node sizing, cluster autoscaler<\/td>\n<td>pod pending, node utilization<\/td>\n<td>K8s metrics-server, Prometheus<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>Serverless<\/td>\n<td>Concurrency limits and cold starts<\/td>\n<td>invocations, concurrency, cold start rate<\/td>\n<td>Serverless platform metrics<\/td>\n<\/tr>\n<tr>\n<td>L8<\/td>\n<td>CI\/CD<\/td>\n<td>Runner capacity and pipeline throughput<\/td>\n<td>job queue length, runner utilization<\/td>\n<td>CI dashboards<\/td>\n<\/tr>\n<tr>\n<td>L9<\/td>\n<td>Incident response<\/td>\n<td>Runbook execution capacity and TTR<\/td>\n<td>incident count, MTTR, on-call load<\/td>\n<td>Pager, incident systems<\/td>\n<\/tr>\n<tr>\n<td>L10<\/td>\n<td>Security<\/td>\n<td>Capacity for logging, SIEM, scanning<\/td>\n<td>log ingestion rate, scan throughput<\/td>\n<td>SIEM, logging pipeline<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use Capacity Planning?<\/h2>\n\n\n\n<p>When it\u2019s necessary:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Before major marketing events or product launches.<\/li>\n<li>Before architectural changes affecting capacity (new caching, auth, database shard).<\/li>\n<li>When SLIs approach SLO thresholds regularly.<\/li>\n<li>When forecasting budget or negotiating cloud discounts.<\/li>\n<\/ul>\n\n\n\n<p>When it\u2019s optional:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Small features with negligible resource impact.<\/li>\n<li>Early-stage prototypes where speed to iterate matters more than exact sizing.<\/li>\n<\/ul>\n\n\n\n<p>When NOT to use \/ overuse it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid micromanaging autoscaling minute-by-minute; rely on proven autoscalers for short-term needs.<\/li>\n<li>Don\u2019t over-plan for extremely low-probability events at the cost of innovation.<\/li>\n<\/ul>\n\n\n\n<p>Decision checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If expected traffic increase &gt; 20% and error budget &lt; 20% -&gt; run full capacity plan.<\/li>\n<li>If deploying new service with unknown load -&gt; start with conservative autoscaling and mid-term planning.<\/li>\n<li>If SLOs stable and cost under budget -&gt; periodic review sufficient.<\/li>\n<\/ul>\n\n\n\n<p>Maturity ladder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Beginner: Manual thresholds and ad-hoc load tests.<\/li>\n<li>Intermediate: Automated telemetry ingestion, simple forecasting, IaC reservations.<\/li>\n<li>Advanced: ML-assisted forecasting, integrated cost models, cross-service optimization, policy-driven autoscaling.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does Capacity Planning work?<\/h2>\n\n\n\n<p>Step-by-step components and workflow:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Inputs collection: Historical telemetry, traffic patterns, release calendar, business events, capacity constraints.<\/li>\n<li>Workload modeling: Characterize request shapes, resource per-request, concurrency.<\/li>\n<li>Forecasting: Short\/mid\/long horizons; incorporate seasonality and event signals.<\/li>\n<li>Constraint application: Budget, quotas, compliance limitations.<\/li>\n<li>Decisioning: Recommend autoscaler parameters, reservations, instance types, shard counts.<\/li>\n<li>Execution: Apply IaC changes, update HPA\/HVA, reserve capacity, tune autoscalers.<\/li>\n<li>Validation: Run load tests, monitor SLOs, adjust plans.<\/li>\n<li>Feedback: Postmortem and telemetry feed back into models.<\/li>\n<\/ol>\n\n\n\n<p>Data flow and lifecycle:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Telemetry ingestion -&gt; Data warehouse \/ feature store -&gt; Forecast models -&gt; Capacity recommendations -&gt; IaC \/ orchestration -&gt; Observability -&gt; Model retraining.<\/li>\n<\/ul>\n\n\n\n<p>Edge cases and failure modes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sudden unknown traffic patterns (viral growth).<\/li>\n<li>Hidden resource bottlenecks like ephemeral ports or DB connections.<\/li>\n<li>Quota limits blocking autoscaler expansion.<\/li>\n<li>Cross-service cascading failures where downstream throttles increase upstream load.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for Capacity Planning<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pattern: Reactive autoscaling with forecasted reserve \u2014 use when traffic predictable with occasional bursts.<\/li>\n<li>Pattern: Reserved capacity with autoscaler for bursts \u2014 for high-throughput services requiring steady baseline.<\/li>\n<li>Pattern: Multi-cluster failover capacity \u2014 for resilience and region-level outages.<\/li>\n<li>Pattern: Serverless concurrency limits with pre-warming \u2014 for spiky workloads sensitive to cold starts.<\/li>\n<li>Pattern: Capacity-as-code pipeline \u2014 automated plan generation and PRs for IaC changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Underprovision<\/td>\n<td>SLO breaches and timeouts<\/td>\n<td>Forecast underestimated traffic<\/td>\n<td>Increase reserve and adjust model<\/td>\n<td>rising p95 latency<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Overprovision<\/td>\n<td>High cost with low utilization<\/td>\n<td>Conservative buffer too large<\/td>\n<td>Tune targets and rightsizing<\/td>\n<td>low CPU and mem usage<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>Autoscaler lag<\/td>\n<td>Sudden error spikes during scale-up<\/td>\n<td>Slow scaling or cool-downs<\/td>\n<td>Faster scale policies and pre-scale<\/td>\n<td>pod pending count<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Quota hit<\/td>\n<td>New instances blocked<\/td>\n<td>Cloud quota limits reached<\/td>\n<td>Increase quotas or pre-reserve<\/td>\n<td>vm launch failures<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>Dependency choke<\/td>\n<td>Upstream errors cascade<\/td>\n<td>Downstream overload<\/td>\n<td>Rate limit and backpressure<\/td>\n<td>downstream error rates<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Misconfigured metrics<\/td>\n<td>Incorrect signals drive wrong decisions<\/td>\n<td>Bad instrumentation or labels<\/td>\n<td>Fix metrics and validate<\/td>\n<td>mismatched telemetry<\/td>\n<\/tr>\n<tr>\n<td>F7<\/td>\n<td>Cost surprise<\/td>\n<td>Unexpected bill spike<\/td>\n<td>Unchecked scaling or runaway jobs<\/td>\n<td>Budget alerts and limits<\/td>\n<td>billing anomalies<\/td>\n<\/tr>\n<tr>\n<td>F8<\/td>\n<td>Hotspots<\/td>\n<td>Uneven load across shards<\/td>\n<td>Poor sharding or affinity<\/td>\n<td>Rebalance and reshard<\/td>\n<td>imbalanced utilization<\/td>\n<\/tr>\n<tr>\n<td>F9<\/td>\n<td>Cold starts<\/td>\n<td>Latency spikes in serverless<\/td>\n<td>Insufficient pre-warm or high cold start times<\/td>\n<td>Provisioned concurrency<\/td>\n<td>cold start rate<\/td>\n<\/tr>\n<tr>\n<td>F10<\/td>\n<td>Human process gap<\/td>\n<td>Runbooks not followed during incidents<\/td>\n<td>Lack of automation and training<\/td>\n<td>Automate and train on playbooks<\/td>\n<td>increased MTTR<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for Capacity Planning<\/h2>\n\n\n\n<p>(Each line: Term \u2014 definition \u2014 why it matters \u2014 common pitfall)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Provisioning \u2014 Allocating resources for workloads \u2014 Ensures capacity exists \u2014 Over-commit without monitoring<\/li>\n<li>Autoscaling \u2014 Automatic scaling of resources \u2014 Handles variable load \u2014 Misconfigured thresholds<\/li>\n<li>Right-sizing \u2014 Matching resource sizes to needs \u2014 Reduces waste \u2014 Premature optimization<\/li>\n<li>Forecasting \u2014 Predicting future demand \u2014 Drives planning horizon \u2014 Ignoring variance<\/li>\n<li>SLO \u2014 Service Level Objective \u2014 Targets that guide capacity \u2014 Vague or unmeasured SLOs<\/li>\n<li>SLI \u2014 Service Level Indicator \u2014 Metric representing user experience \u2014 Wrong metric selection<\/li>\n<li>Error budget \u2014 Allowed error margin \u2014 Balances risk and releases \u2014 Burned unnoticed<\/li>\n<li>Headroom \u2014 Reserved capacity above expected demand \u2014 Absorbs spikes \u2014 Too much cost<\/li>\n<li>Baseline capacity \u2014 Minimum required resources \u2014 Guarantees availability \u2014 Forgotten growth<\/li>\n<li>Burst capacity \u2014 Temporary scaling for spikes \u2014 Handles short bursts \u2014 Unbounded burst costs<\/li>\n<li>Concurrency \u2014 Simultaneous requests handled \u2014 Affects resource per request \u2014 Ignoring concurrency limits<\/li>\n<li>Throttling \u2014 Limiting requests to prevent overload \u2014 Protects systems \u2014 Poor UX if aggressive<\/li>\n<li>Capacity model \u2014 Mapping demand to resources \u2014 Core of planning \u2014 Outdated models<\/li>\n<li>Workload profile \u2014 Characteristics of a workload \u2014 Informs tuning \u2014 Mixing heterogeneous workloads<\/li>\n<li>Resource utilization \u2014 CPU\/memory\/disk usage \u2014 Shows efficiency \u2014 Misinterpreting averages<\/li>\n<li>Percentile latency \u2014 Tail performance measure \u2014 Captures user experience \u2014 Focus on mean only<\/li>\n<li>Backpressure \u2014 Flow control upstream \u2014 Prevents overload \u2014 Not implemented widely<\/li>\n<li>Queue depth \u2014 Pending work backlog \u2014 Early warning signal \u2014 Unmonitored queues<\/li>\n<li>IOps \u2014 Storage operations per second \u2014 Limits throughput \u2014 Ignoring burst IO<\/li>\n<li>Network throughput \u2014 Bandwidth usage \u2014 External bottlenecks \u2014 Not testing cross-region<\/li>\n<li>Cold start \u2014 Latency for initializing serverless \u2014 Impacts latency \u2014 No pre-warm strategy<\/li>\n<li>Reserved instances \u2014 Long-term capacity reservations \u2014 Cost savings \u2014 Underutilized reservations<\/li>\n<li>Spot\/preemptible \u2014 Discounted transient compute \u2014 Cost-effective \u2014 Risk of eviction<\/li>\n<li>Quota \u2014 Provider resource limits \u2014 Can block scaling \u2014 Missing quota increases<\/li>\n<li>Pod density \u2014 Pods per node \u2014 Node-level efficiency \u2014 Too high causing noisy neighbors<\/li>\n<li>Sharding \u2014 Splitting data to scale \u2014 Improves throughput \u2014 Hot partition risk<\/li>\n<li>Thundering herd \u2014 Many clients retry simultaneously \u2014 Causes overload \u2014 Missing jitter\/backoff<\/li>\n<li>Rate limit \u2014 Maximum allowed requests \u2014 Protects endpoints \u2014 Incorrect limits impact RU<\/li>\n<li>Feature store \u2014 Storage of model inputs \u2014 Useful for forecasting \u2014 Data freshness issues<\/li>\n<li>Telemetry ingestion \u2014 Collecting metrics\/logs\/traces \u2014 Inputs for models \u2014 Sampling gaps<\/li>\n<li>Anomaly detection \u2014 Identifying outliers \u2014 Early warning \u2014 High false positives<\/li>\n<li>Headroom policy \u2014 Rules for reserve sizing \u2014 Governance \u2014 Not aligned with SLOs<\/li>\n<li>Load generator \u2014 Tool to simulate traffic \u2014 Validates plans \u2014 Not representative of real users<\/li>\n<li>Cluster autoscaler \u2014 Scales cluster nodes \u2014 Controls infra scale \u2014 Misalignment with pod metrics<\/li>\n<li>Horizontal scaling \u2014 Add more instances \u2014 Handles parallelism \u2014 Statefulness complicates<\/li>\n<li>Vertical scaling \u2014 Increase instance size \u2014 Simple for single-node workloads \u2014 Downtime risk<\/li>\n<li>Throttle budget \u2014 Allocation for throttled requests \u2014 Controls rate-limited impact \u2014 Hard to tune<\/li>\n<li>Capacity-as-code \u2014 Declarative capacity changes \u2014 Auditability \u2014 Overly rigid templates<\/li>\n<li>Cost model \u2014 Mapping usage to dollars \u2014 Enables trade-offs \u2014 Hidden cloud costs<\/li>\n<li>Postmortem \u2014 Incident analysis \u2014 Improves planning \u2014 Blame culture kills learning<\/li>\n<li>Observability signal \u2014 Metric or trace indicating state \u2014 Essential for feedback \u2014 Missing context<\/li>\n<li>Canary \u2014 Gradual rollout technique \u2014 Reduces blast radius \u2014 Small samples may hide issues<\/li>\n<li>Runbook \u2014 Step-by-step operations play \u2014 Reduces MTTR \u2014 Outdated runbooks<\/li>\n<li>Game day \u2014 Simulated outage\/drill \u2014 Validates capacity plans \u2014 Poorly scoped exercises<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure Capacity Planning (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>Request throughput (QPS)<\/td>\n<td>Load arriving at service<\/td>\n<td>Count requests per second per endpoint<\/td>\n<td>Use historical peak as baseline<\/td>\n<td>Bursty traffic skews mean<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>p95 latency<\/td>\n<td>User experience at tail<\/td>\n<td>Compute 95th percentile response time<\/td>\n<td>Below SLO threshold<\/td>\n<td>p95 hides p99 issues<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Error rate<\/td>\n<td>Failures impacting users<\/td>\n<td>errors\/total over window<\/td>\n<td>Below error budget burn<\/td>\n<td>Transient errors inflate rate<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>CPU utilization<\/td>\n<td>Processing capacity used<\/td>\n<td>avg CPU per instance<\/td>\n<td>50-70% as starting point<\/td>\n<td>High bursts cause noisy neighbor<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>Memory usage<\/td>\n<td>Resident working set<\/td>\n<td>rss per process or pod<\/td>\n<td>60-80% headroom<\/td>\n<td>OOM risk if underestimated<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>Pod pending count<\/td>\n<td>Insufficient cluster nodes<\/td>\n<td>count of pending pods<\/td>\n<td>Zero sustained pending<\/td>\n<td>Short spikes may be OK<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>Node utilization<\/td>\n<td>Cluster efficiency<\/td>\n<td>CPU and mem per node<\/td>\n<td>60-80% target<\/td>\n<td>High variance per node<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>DB connections<\/td>\n<td>Connection saturation risk<\/td>\n<td>active connections<\/td>\n<td>Below DB max minus reserve<\/td>\n<td>Leaked connections cause slowdowns<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>Queue depth<\/td>\n<td>Work backlog indicator<\/td>\n<td>pending messages<\/td>\n<td>Low single-digit steady<\/td>\n<td>Hidden spikes during failures<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Cold start rate<\/td>\n<td>Serverless warmup health<\/td>\n<td>fraction of cold starts<\/td>\n<td>Minimize for latency-sensitive<\/td>\n<td>Platform limits vary<\/td>\n<\/tr>\n<tr>\n<td>M11<\/td>\n<td>Error budget burn rate<\/td>\n<td>Risk of SLO breach<\/td>\n<td>error budget consumed per time<\/td>\n<td>Alert on elevated burn<\/td>\n<td>Fast burn needs rapid action<\/td>\n<\/tr>\n<tr>\n<td>M12<\/td>\n<td>Billing anomaly<\/td>\n<td>Cost change indicator<\/td>\n<td>daily cost vs baseline<\/td>\n<td>Small predictable variance<\/td>\n<td>Multi-currency\/discounts hide signals<\/td>\n<\/tr>\n<tr>\n<td>M13<\/td>\n<td>Pod restart rate<\/td>\n<td>Stability of pods<\/td>\n<td>restarts per time<\/td>\n<td>Near zero steady state<\/td>\n<td>Crashes can mask capacity issues<\/td>\n<\/tr>\n<tr>\n<td>M14<\/td>\n<td>Throttle count<\/td>\n<td>Requests rejected due to rate limit<\/td>\n<td>throttled requests<\/td>\n<td>Low single-digit percent<\/td>\n<td>Too strict causes UX regressions<\/td>\n<\/tr>\n<tr>\n<td>M15<\/td>\n<td>Replica count<\/td>\n<td>Scaling behavior<\/td>\n<td>desired vs available replicas<\/td>\n<td>Matches forecasted need<\/td>\n<td>Crash loops reduce available pods<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure Capacity Planning<\/h3>\n\n\n\n<p>Provide 5\u201310 tools with exact structure.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Prometheus + Thanos<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Time-series metrics like CPU, mem, request rates, custom SLIs.<\/li>\n<li>Best-fit environment: Kubernetes, hybrid cloud, open-source stacks.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument services with metrics and labels.<\/li>\n<li>Deploy Prometheus scrapers and recording rules.<\/li>\n<li>Configure Thanos for long-term storage and federation.<\/li>\n<li>Build queries for SLIs and forecast inputs.<\/li>\n<li>Export alerts to Alertmanager.<\/li>\n<li>Strengths:<\/li>\n<li>Flexible query language and ecosystem.<\/li>\n<li>Native for K8s and custom metrics.<\/li>\n<li>Limitations:<\/li>\n<li>Operational overhead at scale.<\/li>\n<li>Long-term storage requires extra components.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Grafana<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Visualization and dashboards for SLIs and utilization.<\/li>\n<li>Best-fit environment: Any metrics backend supported.<\/li>\n<li>Setup outline:<\/li>\n<li>Connect to Prometheus, cloud metrics, or APM.<\/li>\n<li>Create dashboards for exec\/on-call\/debug views.<\/li>\n<li>Configure panels for SLO and burn-rate.<\/li>\n<li>Set up reporting and playlists.<\/li>\n<li>Strengths:<\/li>\n<li>Rich visualization and alerting integration.<\/li>\n<li>Multi-tenant dashboards.<\/li>\n<li>Limitations:<\/li>\n<li>Dashboards need maintenance; alerting limited to datasource features.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Cloud provider monitoring (native)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Provider-level metrics and cost telemetry.<\/li>\n<li>Best-fit environment: IaaS and managed services in a single cloud.<\/li>\n<li>Setup outline:<\/li>\n<li>Enable provider monitoring for instances and services.<\/li>\n<li>Collect quota and billing metrics.<\/li>\n<li>Integrate with alerting and dashboards.<\/li>\n<li>Strengths:<\/li>\n<li>Deep integration with provider resources.<\/li>\n<li>Often has cost and quota signals.<\/li>\n<li>Limitations:<\/li>\n<li>Varies per provider and may not cover apps.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Load testing tools (k6, JMeter, bespoke)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Performance under controlled load and concurrency.<\/li>\n<li>Best-fit environment: Pre-production and staging environments.<\/li>\n<li>Setup outline:<\/li>\n<li>Model realistic user flows.<\/li>\n<li>Run ramp tests and soak tests.<\/li>\n<li>Collect SLIs under load.<\/li>\n<li>Compare to forecasts.<\/li>\n<li>Strengths:<\/li>\n<li>Simulates user pressure and validates models.<\/li>\n<li>Limitations:<\/li>\n<li>Hard to perfectly emulate real-world behavior.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 APM (Application Performance Monitoring)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Traces, service maps, per-request resource cost.<\/li>\n<li>Best-fit environment: Microservices and distributed systems.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument services for traces and spans.<\/li>\n<li>Identify high-cost endpoints.<\/li>\n<li>Combine with metrics for capacity planning.<\/li>\n<li>Strengths:<\/li>\n<li>Root-cause analysis and per-endpoint insights.<\/li>\n<li>Limitations:<\/li>\n<li>Cost and sampling constraints.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Cost management platforms<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Capacity Planning: Cost attribution and forecasted spend.<\/li>\n<li>Best-fit environment: Multi-cloud and large cloud spenders.<\/li>\n<li>Setup outline:<\/li>\n<li>Link billing accounts.<\/li>\n<li>Tag resources for allocation.<\/li>\n<li>Use forecasts for budget planning.<\/li>\n<li>Strengths:<\/li>\n<li>Financial perspective and anomaly detection.<\/li>\n<li>Limitations:<\/li>\n<li>Attribution complexity and tag discipline required.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for Capacity Planning<\/h3>\n\n\n\n<p>Executive dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: Global SLO compliance, total cost and trend, error budget burn rate by service, regional capacity headroom, upcoming events impacting demand.<\/li>\n<li>Why: High-level view for product and finance stakeholders.<\/li>\n<\/ul>\n\n\n\n<p>On-call dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: SLOs and SLIs per service, pod pending, node utilization, queue depth, DB connections, recent deploys, active incidents.<\/li>\n<li>Why: Rapid triage and resource-focused signals during incidents.<\/li>\n<\/ul>\n\n\n\n<p>Debug dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: Per-endpoint latency percentiles, trace samples, CPU\/mem per pod, request rates, retry\/backoff counts, dependency error rates.<\/li>\n<li>Why: Deep investigation and tuning.<\/li>\n<\/ul>\n\n\n\n<p>Alerting guidance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page vs ticket: Page for imminent SLO breach or significant capacity loss; ticket for capacity drift or cost anomalies that don\u2019t impact SLIs.<\/li>\n<li>Burn-rate guidance: Page when error budget burn rate indicates crossing SLO in next 1\u20132 hours; ticket for slower burn.<\/li>\n<li>Noise reduction tactics: Deduplicate alerts by grouping by service and region; suppress alerts during planned maintenance; use alert scoring and latency windows.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p>1) Prerequisites\n&#8211; Instrumentation baseline: request counters, latency histograms, resource metrics.\n&#8211; Tagging and taxonomy: consistent service and environment labels.\n&#8211; Observability pipeline: metrics, traces, logs stored and queryable.\n&#8211; Stakeholder alignment: SRE, product, finance, security.<\/p>\n\n\n\n<p>2) Instrumentation plan\n&#8211; Define SLIs and label conventions.\n&#8211; Add per-request resource cost markers (time, DB calls).\n&#8211; Track queue depth, connection pools, and retry behavior.<\/p>\n\n\n\n<p>3) Data collection\n&#8211; Set retention policies for forecasting horizon.\n&#8211; Aggregate metrics into a feature store or data warehouse.\n&#8211; Ensure time-sync and consistent cardinality.<\/p>\n\n\n\n<p>4) SLO design\n&#8211; Define user-focused SLIs and SLOs with error budgets.\n&#8211; Map SLOs to capacity thresholds (e.g., p95 &lt; X ms at &lt; Y% error).<\/p>\n\n\n\n<p>5) Dashboards\n&#8211; Build executive, on-call, debug dashboards as above.\n&#8211; Include forecast panels and capacity headroom.<\/p>\n\n\n\n<p>6) Alerts &amp; routing\n&#8211; Implement alert tiers: Info (ticket), Warning (ticket+owner), Critical (page on-call).\n&#8211; Route by service and region; include runbook links.<\/p>\n\n\n\n<p>7) Runbooks &amp; automation\n&#8211; Create runbooks for common capacity incidents.\n&#8211; Automate routine actions (scale-up, warm caches) with safety checks.<\/p>\n\n\n\n<p>8) Validation (load\/chaos\/game days)\n&#8211; Run load tests and game days to validate headroom and autoscaling behavior.\n&#8211; Run chaos tests on dependencies to see impact on capacity.<\/p>\n\n\n\n<p>9) Continuous improvement\n&#8211; Postmortems for capacity incidents.\n&#8211; Update models with new telemetry and events.\n&#8211; Quarterly capacity reviews with finance and product.<\/p>\n\n\n\n<p>Checklists<\/p>\n\n\n\n<p>Pre-production checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instrument SLIs and resource metrics.<\/li>\n<li>Have load-test harness and sample traffic profiles.<\/li>\n<li>Baseline SLOs defined and monitored.<\/li>\n<li>Capacity model initialized with conservative estimates.<\/li>\n<\/ul>\n\n\n\n<p>Production readiness checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alerts and runbooks in place.<\/li>\n<li>Autoscaling and quota checks validated.<\/li>\n<li>Cost controls and billing alerts configured.<\/li>\n<li>On-call trained on runbooks.<\/li>\n<\/ul>\n\n\n\n<p>Incident checklist (Capacity Planning specific):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Verify SLO status and error budget.<\/li>\n<li>Check autoscaler and node events (scaling or failures).<\/li>\n<li>Inspect pending pods, queue depth, DB connections.<\/li>\n<li>Execute predefined scale or throttling actions.<\/li>\n<li>Record actions and timelines for postmortem.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of Capacity Planning<\/h2>\n\n\n\n<p>Provide 8\u201312 use cases.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p>Retail flash sale\n&#8211; Context: Massive but time-bound traffic spike.\n&#8211; Problem: Origin DB and cache saturation.\n&#8211; Why helps: Forecast and pre-warm cache and DB replicas.\n&#8211; What to measure: QPS, cache hit ratio, DB CPU\/IO.\n&#8211; Typical tools: Load testing, CDN config, DB monitoring.<\/p>\n<\/li>\n<li>\n<p>Global expansion\n&#8211; Context: Launching in new region.\n&#8211; Problem: Latency-sensitive user experience and legal residency.\n&#8211; Why helps: Plan regional clusters and failover capacity.\n&#8211; What to measure: regional latency, replica counts, failover time.\n&#8211; Typical tools: K8s cluster provisioning, metrics, tracing.<\/p>\n<\/li>\n<li>\n<p>Feature ramp\n&#8211; Context: Gradual feature rollout with increasing adoption.\n&#8211; Problem: Unknown resource per-user cost.\n&#8211; Why helps: Predict resource requirements and reserve capacity.\n&#8211; What to measure: resource per active user, event rates.\n&#8211; Typical tools: APM, feature flags, telemetry.<\/p>\n<\/li>\n<li>\n<p>CI\/CD pipeline scale\n&#8211; Context: Growing number of builds and tests.\n&#8211; Problem: Queueing and slow build times.\n&#8211; Why helps: Size runners and ephemeral capacity.\n&#8211; What to measure: job queue length, runner utilization.\n&#8211; Typical tools: CI metrics, autoscaling runners.<\/p>\n<\/li>\n<li>\n<p>Serverless API with cold starts\n&#8211; Context: Event-driven backend with sporadic spikes.\n&#8211; Problem: Cold starts increase latency.\n&#8211; Why helps: Provisioned concurrency or scheduled pre-warm.\n&#8211; What to measure: cold start rate, latency, concurrency.\n&#8211; Typical tools: Serverless platform metrics.<\/p>\n<\/li>\n<li>\n<p>Database scaling and sharding\n&#8211; Context: Growing data volume and hotspots.\n&#8211; Problem: Single shard saturates IOPS.\n&#8211; Why helps: Plan shards, replication, and read replicas.\n&#8211; What to measure: shard latency, hot partition metrics.\n&#8211; Typical tools: DB monitoring, query profilers.<\/p>\n<\/li>\n<li>\n<p>Incident remediation capacity\n&#8211; Context: Multiple incidents require human attention.\n&#8211; Problem: On-call overload and high MTTR.\n&#8211; Why helps: Capacity planning for human operations and automation.\n&#8211; What to measure: incidents per week, mean time to resolution.\n&#8211; Typical tools: Pager metrics, runbook automation.<\/p>\n<\/li>\n<li>\n<p>Cost containment during growth\n&#8211; Context: Rapid usage growth threatens budget.\n&#8211; Problem: Unexpected cloud bill increases.\n&#8211; Why helps: Forecast cost and evaluate spot\/commitment trade-offs.\n&#8211; What to measure: cost per feature, forecast spend.\n&#8211; Typical tools: Cost management platforms.<\/p>\n<\/li>\n<li>\n<p>Multi-tenant SaaS scaling\n&#8211; Context: Tenants with varied resource profiles.\n&#8211; Problem: Noisy neighbors and unfair resource consumption.\n&#8211; Why helps: Right-sizing, quotaing, and tenant isolation.\n&#8211; What to measure: per-tenant resource usage, isolation metrics.\n&#8211; Typical tools: Multi-tenant telemetry, quotas.<\/p>\n<\/li>\n<li>\n<p>Disaster recovery capacity\n&#8211; Context: Region outage requires failover.\n&#8211; Problem: Failover capacity needs to cover traffic surge.\n&#8211; Why helps: Reserve capacity and rehearse failovers.\n&#8211; What to measure: failover time, capacity headroom in secondary regions.\n&#8211; Typical tools: DR runbooks, failover drills.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes autoscaling for microservices<\/h3>\n\n\n\n<p><strong>Context:<\/strong> E-commerce service running on Kubernetes with daily traffic peaks.\n<strong>Goal:<\/strong> Ensure checkout service meets p95 latency SLO during peak traffic while minimizing cost.\n<strong>Why Capacity Planning matters here:<\/strong> K8s node and pod scaling must coordinate to avoid pod pending and high latency.\n<strong>Architecture \/ workflow:<\/strong> HPA on pods based on request rate and custom metric CPU per request; Cluster Autoscaler adds nodes when pods pending; Prometheus collects metrics; Grafana dashboards for SLOs.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instrument service for requests and per-request CPU.<\/li>\n<li>Create HPA using custom metrics and conservative target.<\/li>\n<li>Configure Cluster Autoscaler with node groups across zones.<\/li>\n<li>Forecast peak QPS from historical data and pre-warm nodes before predicted peak.<\/li>\n<li>Run load test to validate.\n<strong>What to measure:<\/strong> pod pending count, pod restart rate, p95 latency, node utilization.\n<strong>Tools to use and why:<\/strong> Prometheus for metrics, K8s HPA and Cluster Autoscaler, Grafana for dashboards, k6 for load testing.\n<strong>Common pitfalls:<\/strong> HPA using CPU alone misses IO-bound endpoints; cluster autoscaler cool-down too long.\n<strong>Validation:<\/strong> Run soak test at projected peak and measure SLO compliance for 2 hours.\n<strong>Outcome:<\/strong> Predictable scaling with &lt;1% SLO violations during real traffic peaks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless API with provisioned concurrency<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Event-driven image processing API on managed serverless platform.\n<strong>Goal:<\/strong> Reduce cold starts and keep p95 latency under threshold during campaigns.\n<strong>Why Capacity Planning matters here:<\/strong> Without pre-warm, response latency spikes on bursts.\n<strong>Architecture \/ workflow:<\/strong> Provisioned concurrency set based on forecasted bursts; SQS buffer with consumers scaled.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Collect historical invocation patterns and campaign calendar.<\/li>\n<li>Set baseline provisioned concurrency and schedule increases during campaigns.<\/li>\n<li>Monitor cold start rate and adjust schedule.\n<strong>What to measure:<\/strong> concurrency, cold start rate, queue depth, latency.\n<strong>Tools to use and why:<\/strong> Platform metrics, queue metrics, cost dashboard.\n<strong>Common pitfalls:<\/strong> Provisioned concurrency costs more; over-provisioning wastes budget.\n<strong>Validation:<\/strong> Schedule a test campaign and simulate traffic.\n<strong>Outcome:<\/strong> Significantly reduced cold-start latency with acceptable incremental cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Incident-response driven postmortem capacity adjustment<\/h3>\n\n\n\n<p><strong>Context:<\/strong> DB saturation incident during nightly batch causing daytime customer errors.\n<strong>Goal:<\/strong> Prevent recurrence and ensure daytime SLOs.\n<strong>Why Capacity Planning matters here:<\/strong> Night jobs consumption impacted day traffic due to shared DB pool.\n<strong>Architecture \/ workflow:<\/strong> Separate DB pools for batch and interactive; throttle batch and schedule windows.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Postmortem identifies DB connection exhaustion.<\/li>\n<li>Update capacity plan to allocate separate clusters or pools.<\/li>\n<li>Implement job rate limits and monitor connections.\n<strong>What to measure:<\/strong> DB connections, query latency, job throughput.\n<strong>Tools to use and why:<\/strong> DB monitoring, job scheduler metrics, runbooks.\n<strong>Common pitfalls:<\/strong> Temporary fixes without architectural changes.\n<strong>Validation:<\/strong> Run batch in isolated pool and measure daytime performance.\n<strong>Outcome:<\/strong> No daytime SLO violations after changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost vs performance trade-off for compute instances<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Growing compute costs due to general-purpose instance family.\n<strong>Goal:<\/strong> Reduce cost while maintaining latency objectives.\n<strong>Why Capacity Planning matters here:<\/strong> Changing instance types or mixing spot instances affects both performance and risk.\n<strong>Architecture \/ workflow:<\/strong> Evaluate instance families, test performance under load, use spot for stateless services with fallback to on-demand.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Benchmark services on candidate instance types.<\/li>\n<li>Model cost per request and risk of eviction for spot.<\/li>\n<li>Implement mixed instance groups and fallback logic.\n<strong>What to measure:<\/strong> cost per request, p95 latency, evictions.\n<strong>Tools to use and why:<\/strong> Benchmarking tools, cost dashboards, autoscaler with mixed instances.\n<strong>Common pitfalls:<\/strong> Ignoring startup times of heavier instances.\n<strong>Validation:<\/strong> A\/B deploy on different instance families and compare SLO compliance and cost.\n<strong>Outcome:<\/strong> 25\u201340% lower cost with maintained SLOs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p>(Listing symptom -&gt; root cause -&gt; fix; include observability pitfalls)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Symptom: Frequent pod pending during peaks -&gt; Root cause: Cluster autoscaler cool-down too long -&gt; Fix: Tune autoscaler and pre-scale nodes.<\/li>\n<li>Symptom: High cost after enabling autoscaling -&gt; Root cause: Aggressive scale-out without scale-in policies -&gt; Fix: Add scale-in rules and usage-based limits.<\/li>\n<li>Symptom: SLO breach despite high average utilization -&gt; Root cause: Tail latency from noisy neighbors -&gt; Fix: Use lower avg target and isolate noisy workloads.<\/li>\n<li>Symptom: Missing spikes in dashboards -&gt; Root cause: Low-resolution metrics retention -&gt; Fix: Increase scrape frequency and retention for high-res windows.<\/li>\n<li>Symptom: False capacity alarms -&gt; Root cause: Alert thresholds on averages -&gt; Fix: Use percentiles and short evaluation windows.<\/li>\n<li>Symptom: Over-reserved DB replicas -&gt; Root cause: Conservative team estimates -&gt; Fix: Benchmark and right-size with auto-scaling replicas.<\/li>\n<li>Symptom: Autoscaler doesn&#8217;t scale stateful workloads -&gt; Root cause: Stateful design limits scaling -&gt; Fix: Re-architect for statelessness or plan capacity.<\/li>\n<li>Symptom: Repeated quota errors -&gt; Root cause: Missing quota increases from provider -&gt; Fix: Request quota increase and track quota metrics.<\/li>\n<li>Symptom: On-call overload during events -&gt; Root cause: No automation for routine scale actions -&gt; Fix: Automate scaling with safety gates.<\/li>\n<li>Symptom: Inaccurate forecasts -&gt; Root cause: Ignoring recent product changes -&gt; Fix: Incorporate release calendar and feature adoption signals.<\/li>\n<li>Symptom: Hidden cost from logs -&gt; Root cause: High log retention without sampling -&gt; Fix: Implement sampling and tiered retention.<\/li>\n<li>Symptom: Hot shard causing degraded throughput -&gt; Root cause: Poor partitioning key -&gt; Fix: Repartition or add hotspot mitigation.<\/li>\n<li>Symptom: Serverless cold-start spikes -&gt; Root cause: No provisioned concurrency -&gt; Fix: Use provisioned concurrency and warmers.<\/li>\n<li>Symptom: Missing context in metrics -&gt; Root cause: Poor labels and tagging -&gt; Fix: Enforce label taxonomies and reduce cardinality.<\/li>\n<li>Symptom: Inability to reproduce performance -&gt; Root cause: Test traffic doesn&#8217;t match production patterns -&gt; Fix: Capture real traffic traces or use production-like workloads.<\/li>\n<li>Symptom: Erroneous rightsizing recommendations -&gt; Root cause: Sampling bias in telemetry -&gt; Fix: Broader time windows and outlier treatment.<\/li>\n<li>Symptom: SLOs drifting over time -&gt; Root cause: Model not updated after product changes -&gt; Fix: Regular SLO review cadence.<\/li>\n<li>Symptom: Throttling causing UX issues -&gt; Root cause: Low rate limits or lack of graceful degrade -&gt; Fix: Implement backpressure and tiered rate limits.<\/li>\n<li>Symptom: Alert storm during scale events -&gt; Root cause: Multiple alerts firing on same root cause -&gt; Fix: Deduplicate and group alerts.<\/li>\n<li>Symptom: Inconsistent autoscaling across regions -&gt; Root cause: Different node types and quotas -&gt; Fix: Standardize instance families and policies.<\/li>\n<li>Symptom: Missing dependency capacity info -&gt; Root cause: Limited observability into third-party services -&gt; Fix: Add synthetic tests and SLAs for dependencies.<\/li>\n<li>Symptom: Long provisioning times -&gt; Root cause: Heavy instance images and boot scripts -&gt; Fix: Use smaller images and pre-baked images.<\/li>\n<li>Symptom: Runbooks ignored -&gt; Root cause: Runbooks not tested or accessible -&gt; Fix: Embed runbooks into incident tooling and train teams.<\/li>\n<li>Symptom: Billing anomalies detected late -&gt; Root cause: Low-frequency billing checks -&gt; Fix: Daily cost monitoring and alerts.<\/li>\n<li>Symptom: Forecasts fail on black-swan events -&gt; Root cause: Model lacks rare-event handling -&gt; Fix: Include stress tests and manual contingency capacity.<\/li>\n<\/ol>\n\n\n\n<p>Observability pitfalls (at least five included above):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Low-resolution metrics, missing labels, sampling bias, lack of synthetic tests, alert storms due to noisy metrics.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<p>Ownership and on-call:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Capacity planning is a shared responsibility: platform\/SRE owns tooling and automation; product\/engineering owns workload forecasts and change signals.<\/li>\n<li>SREs should be on-call for platform-level capacity incidents; product teams should own per-service SLOs.<\/li>\n<\/ul>\n\n\n\n<p>Runbooks vs playbooks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runbook: executable steps for operators during incidents (short, precise).<\/li>\n<li>Playbook: higher-level steps and decision trees (who to engage, escalation paths).<\/li>\n<li>Keep runbooks automated where possible and versioned in repo.<\/li>\n<\/ul>\n\n\n\n<p>Safe deployments:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use canary deployments, progressive rollouts, and automatic rollback on SLO breach.<\/li>\n<li>Verify capacity impact in canary before full rollout.<\/li>\n<\/ul>\n\n\n\n<p>Toil reduction and automation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate routine scaling, pre-warming, and quota checks.<\/li>\n<li>Use policy-driven autoscaling and IaC to reduce manual changes.<\/li>\n<\/ul>\n\n\n\n<p>Security basics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Capacity changes must respect security boundaries and least privilege.<\/li>\n<li>Monitor for unexpected provisioning as an indicator of compromised credentials.<\/li>\n<\/ul>\n\n\n\n<p>Weekly\/monthly routines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: Review spike patterns, failed autoscale events, and critical alerts.<\/li>\n<li>Monthly: SLO review, headroom adjustments, cost vs capacity report.<\/li>\n<li>Quarterly: Forecasting refresh and capacity reserve negotiation.<\/li>\n<\/ul>\n\n\n\n<p>Postmortem review items related to capacity planning:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Root cause mapping to capacity model assumptions.<\/li>\n<li>Whether SLOs or headroom were inadequate.<\/li>\n<li>Execution timelines for capacity actions and delays.<\/li>\n<li>Learnings applied to forecasting and automation.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for Capacity Planning (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Metrics store<\/td>\n<td>Stores time-series metrics for SLIs<\/td>\n<td>APM, exporters, dashboards<\/td>\n<td>Critical input to forecasts<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>Tracing\/APM<\/td>\n<td>Shows per-request cost and dependencies<\/td>\n<td>Metrics, logs<\/td>\n<td>Helps map resource hotspots<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>Cost management<\/td>\n<td>Allocates and forecasts cloud spend<\/td>\n<td>Billing, tagging<\/td>\n<td>Enables cost vs capacity tradeoffs<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>Load testing<\/td>\n<td>Simulates traffic for validation<\/td>\n<td>CI, staging env<\/td>\n<td>Validates autoscaling and SLOs<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>IaC \/ Orchestration<\/td>\n<td>Applies capacity changes as code<\/td>\n<td>CI\/CD, cloud APIs<\/td>\n<td>Auditable provisioning flow<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>Autoscaler<\/td>\n<td>Runtime scaling controller<\/td>\n<td>Metrics store, cloud API<\/td>\n<td>Needs tuned policies<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>Quota manager<\/td>\n<td>Tracks provider limits and requests<\/td>\n<td>Cloud APIs, alerting<\/td>\n<td>Prevents unexpected limits<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>Incident system<\/td>\n<td>Manages incidents and runbooks<\/td>\n<td>Alerting, chatops<\/td>\n<td>Records human capacity actions<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>Game day platform<\/td>\n<td>Schedules and runs simulations<\/td>\n<td>Monitoring, incident systems<\/td>\n<td>Validates plans under stress<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>Forecasting engine<\/td>\n<td>Predicts demand and resource needs<\/td>\n<td>Metrics store, feature store<\/td>\n<td>Can be ML or rules-based<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is the difference between autoscaling and capacity planning?<\/h3>\n\n\n\n<p>Autoscaling reacts to current metrics; capacity planning forecasts demand and sets strategic reserves and policies.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How often should capacity plans be updated?<\/h3>\n\n\n\n<p>Depends on volatility; at minimum monthly for stable workloads and weekly for fast-changing products.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can capacity planning be fully automated?<\/h3>\n\n\n\n<p>Parts can be automated (metrics ingestion, basic forecasting, IaC changes) but human review is required for high-risk decisions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How much headroom should I keep?<\/h3>\n\n\n\n<p>Varies \/ depends; start with 10\u201330% for typical services and adjust by SLO risk and event calendar.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do I include cost in capacity decisions?<\/h3>\n\n\n\n<p>Use cost per request models and include finance in capacity reviews to trade off performance vs spend.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What forecasting methods work best?<\/h3>\n\n\n\n<p>A combination: seasonality-aware time-series models, recent trend adjustments, and event-driven overrides.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do error budgets influence capacity?<\/h3>\n\n\n\n<p>High error budget consumption should trigger capacity actions or release freezes until SLO stabilizes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to handle third-party service limits?<\/h3>\n\n\n\n<p>Model external dependencies, have fallback strategies, and track synthetic tests for dependency health.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is capacity planning relevant for serverless?<\/h3>\n\n\n\n<p>Yes; plan for provisioning concurrency, cold starts, and cost trade-offs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to validate a capacity plan?<\/h3>\n\n\n\n<p>Run load tests, game days, and monitor SLOs during controlled experiments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What telemetry is essential for capacity planning?<\/h3>\n\n\n\n<p>Throughput, latency percentiles, error rate, CPU\/memory, queue depth, and provider quotas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Who should own capacity planning?<\/h3>\n\n\n\n<p>Platform\/SRE leads tooling and automation; product and engineering own workload forecasts and SLOs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to prevent alert fatigue in capacity alerts?<\/h3>\n\n\n\n<p>Use multi-level alerts, group related alerts, set meaningful dedupe and suppression during known events.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to account for cloud quotas?<\/h3>\n\n\n\n<p>Monitor quotas as metrics, request increases ahead of major events, and include quotas in decision engine.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is a reasonable starting SLO for p95 latency?<\/h3>\n\n\n\n<p>Varies \/ depends on product; set SLOs based on user experience goals and iterate.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can I use spot instances for critical services?<\/h3>\n\n\n\n<p>Use spot for fault-tolerant stateless workloads with eviction handling; critical stateful services should avoid spot.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to handle sudden viral traffic?<\/h3>\n\n\n\n<p>Have contingency plans: temporary rate limiting, cache warm-up, and manual pre-scale triggers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What role does observability play?<\/h3>\n\n\n\n<p>Observability provides the signals to forecast, validate, and detect capacity issues early.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Capacity planning is a continuous, cross-functional practice that ensures services meet SLOs, handle demand, and control costs. It relies on instrumentation, forecasting, constrained decision-making, automation, and regular validation via tests and game days.<\/p>\n\n\n\n<p>Next 7 days plan:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Audit current SLIs, SLOs, and instrumentation gaps.<\/li>\n<li>Day 2: Define capacity taxonomy and tag conventions.<\/li>\n<li>Day 3: Build executive and on-call dashboards with baseline panels.<\/li>\n<li>Day 4: Run a short load test on a critical service and record results.<\/li>\n<li>Day 5: Review quota and billing alerts; set up missing notifications.<\/li>\n<li>Day 6: Draft runbooks for top 3 capacity incidents.<\/li>\n<li>Day 7: Schedule a game day and assign roles.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 Capacity Planning Keyword Cluster (SEO)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Primary keywords<\/li>\n<li>Capacity planning<\/li>\n<li>Cloud capacity planning<\/li>\n<li>Capacity planning SRE<\/li>\n<li>Capacity planning tutorial<\/li>\n<li>Capacity planning guide<\/li>\n<li>\n<p>Capacity forecasting<\/p>\n<\/li>\n<li>\n<p>Secondary keywords<\/p>\n<\/li>\n<li>Resource forecasting<\/li>\n<li>Autoscaling strategy<\/li>\n<li>Capacity modeling<\/li>\n<li>Headroom policy<\/li>\n<li>Right-sizing servers<\/li>\n<li>Cloud capacity management<\/li>\n<li>\n<p>Capacity-as-code<\/p>\n<\/li>\n<li>\n<p>Long-tail questions<\/p>\n<\/li>\n<li>How to do capacity planning for Kubernetes<\/li>\n<li>What is capacity planning in cloud computing<\/li>\n<li>Capacity planning best practices for SRE<\/li>\n<li>How to forecast capacity for serverless functions<\/li>\n<li>How to include error budgets in capacity planning<\/li>\n<li>When to pre-warm serverless concurrency<\/li>\n<li>How to plan capacity for database shards<\/li>\n<li>How to set headroom for peak traffic<\/li>\n<li>How to validate capacity plans with load tests<\/li>\n<li>How to automate capacity planning with IaC<\/li>\n<li>What metrics are required for capacity planning<\/li>\n<li>How to reduce cost while keeping capacity<\/li>\n<li>How to plan capacity for multi-tenant SaaS<\/li>\n<li>How to handle quota limits in cloud capacity planning<\/li>\n<li>\n<p>How to create capacity runbooks for on-call<\/p>\n<\/li>\n<li>\n<p>Related terminology<\/p>\n<\/li>\n<li>Autoscaler<\/li>\n<li>SLO<\/li>\n<li>SLI<\/li>\n<li>Error budget<\/li>\n<li>Cluster autoscaler<\/li>\n<li>Horizontal Pod Autoscaler<\/li>\n<li>Provisioned concurrency<\/li>\n<li>Cold start<\/li>\n<li>Load testing<\/li>\n<li>Spot instances<\/li>\n<li>Reserved instances<\/li>\n<li>Quota management<\/li>\n<li>Telemetry ingestion<\/li>\n<li>Feature store<\/li>\n<li>Forecasting engine<\/li>\n<li>Cost per request<\/li>\n<li>Headroom<\/li>\n<li>Workload profile<\/li>\n<li>Resource utilization<\/li>\n<li>Sharding<\/li>\n<li>Throttling<\/li>\n<li>Backpressure<\/li>\n<li>Queue depth<\/li>\n<li>Game day<\/li>\n<li>Runbook<\/li>\n<li>Playbook<\/li>\n<li>Capacity model<\/li>\n<li>Right-sizing<\/li>\n<li>Capacity-as-code<\/li>\n<li>Billing anomaly detection<\/li>\n<li>Observability signal<\/li>\n<li>Canary deployment<\/li>\n<li>Load generator<\/li>\n<li>Postmortem analysis<\/li>\n<li>Cluster node sizing<\/li>\n<li>Pod density<\/li>\n<li>Hotspot mitigation<\/li>\n<li>Rate limit<\/li>\n<li>Memory RSS<\/li>\n<li>Percentile latency<\/li>\n<li>IOPS<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-1170","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/posts\/1170","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/comments?post=1170"}],"version-history":[{"count":0,"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/posts\/1170\/revisions"}],"wp:attachment":[{"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/media?parent=1170"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/categories?post=1170"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devopsschool.org\/blog\/wp-json\/wp\/v2\/tags?post=1170"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}