Observability
Logs, metrics, and traces as one workflow. Know what is wrong before the customer tells you.
intermediate20 labs
Recommended first
You can still start this course now. Earlier courses give you the mental model the labs here assume.
By the end you'll be able to
- Pick the right pillar for the question you are asking
- Write PromQL that answers questions instead of decorating dashboards
- Follow one request across services with a trace
- Turn an SLO into a deploy or do-not-deploy decision
Labs
- 01StartLogs, metrics, and traces, the three pillarsTelemetry comes in three shapes. Logs record single events, metrics record numbers over time, traces record the path of one request. Each shape pre-commits to a question, which is why some answers are cheap and others are expensive.15 min
- 02StartCardinality and what it costsEvery unique label combination on a metric is a separate time series the backend stores and indexes. One wrong label can multiply a single metric into millions of series and take a monitoring system down. Understand what cardinality is, why it explodes, and where the fix lives.15 min
- 03StartWhat observable actually meansObservability is a property of the system, not a product you install. A system is observable when its outputs let you answer questions nobody planned for. Two services can run the same code with the same uptime, yet one is a black box and the other answers anything.15 min
- 04StartStructured logging, key=value and JSONA log line written as prose can only be read by a human. The same event written as key=value or JSON can be filtered, extracted, and counted by machines. The difference is whether the writer named the fields, and it costs nothing at write time.15 min
- 05StartReading system logs in depthA Linux server narrates its life in syslog format, one timestamped line per event from the kernel, sshd, cron, and every daemon. Read that narration fluently, reconstruct an incident timeline with grep context, and aggregate attacker activity with an awk pipeline.15 min
- 06StartLoki basics, labels and queriesLoki is a log database with one big idea, index the labels and never the log content. Understand what a stream is, what the push and query APIs do, and why the cost model splits cleanly into cheap index work and scan work.15 min
- 07StartLog levels, what to keep and what to dropEvery log line carries a severity level, and most production volume is debug noise nobody will ever read. Understand the severity ladder as a threshold, why level is only the first input to a keep or drop decision, and what actually earns a line its storage cost.15 min
- 08StartThe Prometheus data model, counter, gauge, histogramPrometheus carries almost everything as labeled time series, and three metric types do most of the work. Understand what a counter, a gauge, and a histogram each represent, and why the type you choose decides how the metric must be queried.15 min
- 09StartWhat Prometheus is and how it scrapesPrometheus does not wait for your services to send it data. It reaches out and pulls metrics from them on a schedule. Understand the scrape model, what a target is, and why pull beats push for a monitoring system.15 min
- 10StartExpose a /metrics endpointA scrape target is nothing more than an HTTP path that returns exposition text. Understand what the exposition format is, what a /metrics endpoint really exposes, and why a pull-based scraper always sees the current value with no buffer and no push.15 min
- 11StartHow PromQL thinks, vectors, rate, and aggregationPromQL is not SQL. It works on time series, and a raw counter answers almost nothing on its own. Understand instant and range vectors, why rate turns a total into a speed, how aggregation collapses labels, and how a percentile is read out of a histogram.15 min
- 12StartAlert rules and the Alertmanager handoffAn alert is a PromQL expression with a duration and a severity, and that is all it is. Understand what makes a good rule, why the for clause separates an incident from a blip, and how Prometheus hands firing alerts to Alertmanager, which decides who actually gets paged.15 min
- 13StartOpenTelemetry and the anatomy of a spanOpenTelemetry is the shared standard for emitting traces, metrics, and logs. The unit of tracing is the span. Understand what OTel standardizes and what every field on a span means.15 min
- 14StartAuto-instrumentation, the spans you get for freeAuto-instrumentation patches well-known libraries at startup so every HTTP request and database call emits a span without code changes. Understand which spans it produces, what the instrumentation scope tells you, and the one span it can never write for you.15 min
- 15StartRead a trace end to endA trace is a tree of spans across services. Take a realistic seven-span checkout trace, rank the spans by duration, then follow the parent pointers to the one call that actually burned the time.15 min
- 16StartDefine an SLI, latency, error rate, throughputAn SLI is a measured number that stands in for whether users are happy. The three classic ones are error rate, latency at a percentile, and throughput, and each is a computation you can write down from raw request data.15 min
- 17StartSet an SLO, a target and a windowAn SLO attaches a target and a time window to an SLI, turning a measurement into a promise. The target also implies hard numbers, how many failed requests and how many minutes of downtime the window allows.15 min
- 18StartError budget math, when to slow down and when to shipThe error budget is the failure allowance an SLO grants, treated as a resource to spend. Comparing how much is burned against how much of the window has elapsed turns a measured burn rate into a pre-agreed ship or slow-down decision.15 min
- 19StartThe is-something-wrong workflow, dashboard to logs to traceReal triage moves through the three pillars in a fixed order. A dashboard says something is wrong, a log search says what kind of wrong, and a trace says exactly where. Each pillar hands the next one a key, and the chain ends in a finding someone can act on.15 min
- 20StartOn-call alert hygiene, page, ticket, or dropA pager that cries wolf gets muted, and a muted pager is how real outages run for hours. Every alert earns one of three fates by the action it demands, and the runbook and escalation ladder behind a page are what make it survivable at 3 a.m.15 min