Observability

Logs, metrics, and traces as one workflow. Know what is wrong before the customer tells you.

intermediate20 labs

Recommended first

Linux operations Linux

You can still start this course now. Earlier courses give you the mental model the labs here assume.

By the end you'll be able to

Pick the right pillar for the question you are asking
Write PromQL that answers questions instead of decorating dashboards
Follow one request across services with a trace
Turn an SLO into a deploy or do-not-deploy decision

Labs

01
Logs, metrics, and traces, the three pillars
Telemetry comes in three shapes. Logs record single events, metrics record numbers over time, traces record the path of one request. Each shape pre-commits to a question, which is why some answers are cheap and others are expensive.
15 min
Start
02
Cardinality and what it costs
Every unique label combination on a metric is a separate time series the backend stores and indexes. One wrong label can multiply a single metric into millions of series and take a monitoring system down. Understand what cardinality is, why it explodes, and where the fix lives.
15 min
Start
03
What observable actually means
Observability is a property of the system, not a product you install. A system is observable when its outputs let you answer questions nobody planned for. Two services can run the same code with the same uptime, yet one is a black box and the other answers anything.
15 min
Start
04
Structured logging, key=value and JSON
A log line written as prose can only be read by a human. The same event written as key=value or JSON can be filtered, extracted, and counted by machines. The difference is whether the writer named the fields, and it costs nothing at write time.
15 min
Start
05
Reading system logs in depth
A Linux server narrates its life in syslog format, one timestamped line per event from the kernel, sshd, cron, and every daemon. Read that narration fluently, reconstruct an incident timeline with grep context, and aggregate attacker activity with an awk pipeline.
15 min
Start
06
Loki basics, labels and queries
Loki is a log database with one big idea, index the labels and never the log content. Understand what a stream is, what the push and query APIs do, and why the cost model splits cleanly into cheap index work and scan work.
15 min
Start
07
Log levels, what to keep and what to drop
Every log line carries a severity level, and most production volume is debug noise nobody will ever read. Understand the severity ladder as a threshold, why level is only the first input to a keep or drop decision, and what actually earns a line its storage cost.
15 min
Start
08
The Prometheus data model, counter, gauge, histogram
Prometheus carries almost everything as labeled time series, and three metric types do most of the work. Understand what a counter, a gauge, and a histogram each represent, and why the type you choose decides how the metric must be queried.
15 min
Start
09
What Prometheus is and how it scrapes
Prometheus does not wait for your services to send it data. It reaches out and pulls metrics from them on a schedule. Understand the scrape model, what a target is, and why pull beats push for a monitoring system.
15 min
Start
10
Expose a /metrics endpoint
A scrape target is nothing more than an HTTP path that returns exposition text. Understand what the exposition format is, what a /metrics endpoint really exposes, and why a pull-based scraper always sees the current value with no buffer and no push.
15 min
Start
11
How PromQL thinks, vectors, rate, and aggregation
PromQL is not SQL. It works on time series, and a raw counter answers almost nothing on its own. Understand instant and range vectors, why rate turns a total into a speed, how aggregation collapses labels, and how a percentile is read out of a histogram.
15 min
Start
12
Alert rules and the Alertmanager handoff
An alert is a PromQL expression with a duration and a severity, and that is all it is. Understand what makes a good rule, why the for clause separates an incident from a blip, and how Prometheus hands firing alerts to Alertmanager, which decides who actually gets paged.
15 min
Start
13
OpenTelemetry and the anatomy of a span
OpenTelemetry is the shared standard for emitting traces, metrics, and logs. The unit of tracing is the span. Understand what OTel standardizes and what every field on a span means.
15 min
Start
14
Auto-instrumentation, the spans you get for free
Auto-instrumentation patches well-known libraries at startup so every HTTP request and database call emits a span without code changes. Understand which spans it produces, what the instrumentation scope tells you, and the one span it can never write for you.
15 min
Start
15
Read a trace end to end
A trace is a tree of spans across services. Take a realistic seven-span checkout trace, rank the spans by duration, then follow the parent pointers to the one call that actually burned the time.
15 min
Start
16
Define an SLI, latency, error rate, throughput
An SLI is a measured number that stands in for whether users are happy. The three classic ones are error rate, latency at a percentile, and throughput, and each is a computation you can write down from raw request data.
15 min
Start
17
Set an SLO, a target and a window
An SLO attaches a target and a time window to an SLI, turning a measurement into a promise. The target also implies hard numbers, how many failed requests and how many minutes of downtime the window allows.
15 min
Start
18
Error budget math, when to slow down and when to ship
The error budget is the failure allowance an SLO grants, treated as a resource to spend. Comparing how much is burned against how much of the window has elapsed turns a measured burn rate into a pre-agreed ship or slow-down decision.
15 min
Start
19
The is-something-wrong workflow, dashboard to logs to trace
Real triage moves through the three pillars in a fixed order. A dashboard says something is wrong, a log search says what kind of wrong, and a trace says exactly where. Each pillar hands the next one a key, and the chain ends in a finding someone can act on.
15 min
Start
20
On-call alert hygiene, page, ticket, or drop
A pager that cries wolf gets muted, and a muted pager is how real outages run for hours. Every alert earns one of three fates by the action it demands, and the runbook and escalation ladder behind a page are what make it survivable at 3 a.m.
15 min
Start