Skip to main content
DevOpsLabTH.dev

Observability

Logs, metrics, and traces as one workflow. Know what is wrong before the customer tells you.

intermediate20 labs
Recommended first

You can still start this course now. Earlier courses give you the mental model the labs here assume.

By the end you'll be able to
  • Pick the right pillar for the question you are asking
  • Write PromQL that answers questions instead of decorating dashboards
  • Follow one request across services with a trace
  • Turn an SLO into a deploy or do-not-deploy decision

Labs

  1. 01
    Logs, metrics, and traces, the three pillars
    Telemetry comes in three shapes. Logs record single events, metrics record numbers over time, traces record the path of one request. Each shape pre-commits to a question, which is why some answers are cheap and others are expensive.
    15 min
    Start
  2. 02
    Cardinality and what it costs
    Every unique label combination on a metric is a separate time series the backend stores and indexes. One wrong label can multiply a single metric into millions of series and take a monitoring system down. Understand what cardinality is, why it explodes, and where the fix lives.
    15 min
    Start
  3. 03
    What observable actually means
    Observability is a property of the system, not a product you install. A system is observable when its outputs let you answer questions nobody planned for. Two services can run the same code with the same uptime, yet one is a black box and the other answers anything.
    15 min
    Start
  4. 04
    Structured logging, key=value and JSON
    A log line written as prose can only be read by a human. The same event written as key=value or JSON can be filtered, extracted, and counted by machines. The difference is whether the writer named the fields, and it costs nothing at write time.
    15 min
    Start
  5. 05
    Reading system logs in depth
    A Linux server narrates its life in syslog format, one timestamped line per event from the kernel, sshd, cron, and every daemon. Read that narration fluently, reconstruct an incident timeline with grep context, and aggregate attacker activity with an awk pipeline.
    15 min
    Start
  6. 06
    Loki basics, labels and queries
    Loki is a log database with one big idea, index the labels and never the log content. Understand what a stream is, what the push and query APIs do, and why the cost model splits cleanly into cheap index work and scan work.
    15 min
    Start
  7. 07
    Log levels, what to keep and what to drop
    Every log line carries a severity level, and most production volume is debug noise nobody will ever read. Understand the severity ladder as a threshold, why level is only the first input to a keep or drop decision, and what actually earns a line its storage cost.
    15 min
    Start
  8. 08
    The Prometheus data model, counter, gauge, histogram
    Prometheus carries almost everything as labeled time series, and three metric types do most of the work. Understand what a counter, a gauge, and a histogram each represent, and why the type you choose decides how the metric must be queried.
    15 min
    Start
  9. 09
    What Prometheus is and how it scrapes
    Prometheus does not wait for your services to send it data. It reaches out and pulls metrics from them on a schedule. Understand the scrape model, what a target is, and why pull beats push for a monitoring system.
    15 min
    Start
  10. 10
    Expose a /metrics endpoint
    A scrape target is nothing more than an HTTP path that returns exposition text. Understand what the exposition format is, what a /metrics endpoint really exposes, and why a pull-based scraper always sees the current value with no buffer and no push.
    15 min
    Start
  11. 11
    How PromQL thinks, vectors, rate, and aggregation
    PromQL is not SQL. It works on time series, and a raw counter answers almost nothing on its own. Understand instant and range vectors, why rate turns a total into a speed, how aggregation collapses labels, and how a percentile is read out of a histogram.
    15 min
    Start
  12. 12
    Alert rules and the Alertmanager handoff
    An alert is a PromQL expression with a duration and a severity, and that is all it is. Understand what makes a good rule, why the for clause separates an incident from a blip, and how Prometheus hands firing alerts to Alertmanager, which decides who actually gets paged.
    15 min
    Start
  13. 13
    OpenTelemetry and the anatomy of a span
    OpenTelemetry is the shared standard for emitting traces, metrics, and logs. The unit of tracing is the span. Understand what OTel standardizes and what every field on a span means.
    15 min
    Start
  14. 14
    Auto-instrumentation, the spans you get for free
    Auto-instrumentation patches well-known libraries at startup so every HTTP request and database call emits a span without code changes. Understand which spans it produces, what the instrumentation scope tells you, and the one span it can never write for you.
    15 min
    Start
  15. 15
    Read a trace end to end
    A trace is a tree of spans across services. Take a realistic seven-span checkout trace, rank the spans by duration, then follow the parent pointers to the one call that actually burned the time.
    15 min
    Start
  16. 16
    Define an SLI, latency, error rate, throughput
    An SLI is a measured number that stands in for whether users are happy. The three classic ones are error rate, latency at a percentile, and throughput, and each is a computation you can write down from raw request data.
    15 min
    Start
  17. 17
    Set an SLO, a target and a window
    An SLO attaches a target and a time window to an SLI, turning a measurement into a promise. The target also implies hard numbers, how many failed requests and how many minutes of downtime the window allows.
    15 min
    Start
  18. 18
    Error budget math, when to slow down and when to ship
    The error budget is the failure allowance an SLO grants, treated as a resource to spend. Comparing how much is burned against how much of the window has elapsed turns a measured burn rate into a pre-agreed ship or slow-down decision.
    15 min
    Start
  19. 19
    The is-something-wrong workflow, dashboard to logs to trace
    Real triage moves through the three pillars in a fixed order. A dashboard says something is wrong, a log search says what kind of wrong, and a trace says exactly where. Each pillar hands the next one a key, and the chain ends in a finding someone can act on.
    15 min
    Start
  20. 20
    On-call alert hygiene, page, ticket, or drop
    A pager that cries wolf gets muted, and a muted pager is how real outages run for hours. Every alert earns one of three fates by the action it demands, and the runbook and escalation ladder behind a page are what make it survivable at 3 a.m.
    15 min
    Start