// glossary
Concepts, decoded
Every term explained in plain English — what it is, why it matters, a real-world example, the tools around it, and the interview questions it shows up in.
-
CI/CD
Continuous Integration and Continuous Delivery/Deployment: the practice of automatically building, testing, and releasing every code change through a pipeline, so software moves from commit to production quickly, safely, and repeatably.
define
-
CloudOps
The discipline of operating workloads in public cloud environments: provisioning, monitoring, securing, scaling, and cost-managing cloud infrastructure, usually through automation and infrastructure as code.
define
-
DevOps
A culture and set of practices that unites software development and IT operations, using automation, CI/CD, and shared ownership to ship reliable software faster and shorten feedback loops between code and production.
define
-
DevSecOps
The practice of integrating security into every stage of the DevOps lifecycle, shifting security left with automated scanning, policy as code, and shared responsibility instead of a final security gate before release.
define
-
Distributed Tracing
A technique that follows a single request as it travels through multiple services, recording each hop as a timed span, so engineers can see exactly where latency and errors occur in a distributed system.
define
-
Error Budget
The amount of unreliability a service is allowed within its SLO period. If the SLO is 99.9% availability, the error budget is the remaining 0.1%; teams can spend it on risky changes and must slow down when it runs out.
define
-
FinOps
The practice of bringing financial accountability to cloud spending: giving engineering teams visibility into costs, optimizing usage through rightsizing and commitments, and aligning cloud investment with business value.
define
-
GitOps
An operating model where the desired state of infrastructure and applications lives in Git, and automated controllers continuously reconcile the live system to match it. Deployments and rollbacks become pull requests.
define
-
Incident Management
The structured process for detecting, responding to, resolving, and learning from service disruptions — covering on-call, alerting, severity levels, coordinated response roles, and blameless postmortems.
define
-
Infrastructure as Code (IaC)
The practice of defining servers, networks, and cloud resources in machine-readable code instead of manual clicks, so infrastructure can be versioned, reviewed, tested, and recreated identically on demand.
define
-
Internal Developer Platform (IDP)
The self-service product a platform team builds for its developers: a portal, APIs, and golden-path templates that abstract infrastructure so developers can create services, environments, and deployments on demand.
define
-
LLMOps
The operational discipline for building and running applications powered by large language models: prompt management, evaluation, guardrails, cost and latency control, and monitoring of non-deterministic outputs.
define
-
MLOps
The practice of applying DevOps principles to machine learning: versioning data and models, automating training and deployment pipelines, and monitoring models in production for drift and degradation.
define
-
Observability
The ability to understand a system's internal state from its external outputs — metrics, logs, and traces — so you can debug novel problems and ask new questions without shipping new code first.
define
-
Platform Engineering
The discipline of building internal platforms — golden paths, self-service tooling, and paved infrastructure — that let product teams ship software without each team reinventing CI/CD, Kubernetes, and observability themselves.
define
-
PromptOps
The discipline of treating prompts as production artifacts: versioning them in source control, testing changes against evaluation suites, deploying with rollout controls, and monitoring prompt performance over time.
define
-
RAGOps
The operational practice of running Retrieval-Augmented Generation systems in production: managing ingestion and embedding pipelines, vector stores, retrieval quality, and end-to-end evaluation of grounded LLM answers.
define
-
ReleaseOps
The discipline of managing how software reaches users: release planning and trains, progressive delivery with canary and blue-green strategies, feature flags, rollback readiness, and coordination across teams and environments.
define
-
SecOps
The practice of integrating security operations with IT operations: continuous threat monitoring, detection, and incident response, typically centered on a SOC, SIEM tooling, vulnerability management, and threat intelligence.
define
-
SLA (Service Level Agreement)
A formal contract between a provider and its customers that defines promised service levels, such as 99.9% uptime, along with consequences like service credits or penalties if the promise is broken.
define
-
SLI (Service Level Indicator)
A quantitative measurement of some aspect of a service's behavior that users care about, such as request success rate, latency, or freshness. SLIs are the raw metrics on which SLOs and error budgets are built.
define
-
SLO (Service Level Objective)
A target value for a service level indicator over a period, such as '99.9% of requests succeed over 30 days.' SLOs define how reliable a service should be and drive error budgets, alerting, and engineering priorities.
define
-
SRE (Site Reliability Engineering)
An engineering discipline, pioneered at Google, that applies software engineering to operations problems, using SLOs, error budgets, and automation to keep services reliable while still enabling fast change.
define
-
TestOps
The discipline of operating testing at scale: managing test infrastructure, integrating automated suites into CI/CD, controlling flaky tests, provisioning test data and environments, and using quality analytics to guide releases.
define