Jump to

Skill · Monitoring and alerting

Monitoring and alerting.

Decide what to watch, what to alert on, and who finds out when it breaks.

Decide what to watch, what to alert on, and how the right person finds out when something breaks. Monitoring works in four layers, and skipping one misses a class of problem: availability, correctness, performance, and errors. SLOs and error budgets turn reliability into a feedback loop with velocity.

The hard part is alerting without fatigue. A paging alert has to be actionable, important, and rare, because if pages fire several times a week people stop responding.

Audience: engineers and ops teams setting up monitoring, defining SLOs, designing on-call rotations, or fixing alert fatigue.

The framework

Four layers, each catching a different problem.

Monitoring works in layers. Skip one and you miss the class of problem it would have caught.

  1. 01Availability: is the site up? HTTP checks from multiple regions, DNS resolution, certificate expiration, and status-code checks. Sustained downtime pages.
  2. 02Correctness: is it serving the right thing? Synthetic checks on critical journeys (signup, checkout, search), content-presence checks, and API-contract checks.
  3. 03Performance: is it fast enough? Real-user Core Web Vitals, synthetic performance, API latencies at p50, p95, and p99, and slow-query and dependency times. Alert on regressions from baseline.
  4. 04Errors and anomalies: rate-based 5xx and client-error rates, log-volume spikes, traffic falling off a cliff, background-job failures, and queue depth.

What pages, what does not

Three alert tiers, not five.

Three tiers are plenty; a P0-through-P4 ladder becomes a sorting exercise. The page tier is the one that protects against fatigue.

  1. 01

    Page (wakes someone)

    Site down, a critical flow broken, an error-rate spike, a security incident. Must be actionable, important, and rare, under one or two a week. Routed to the paging system.

  2. 02

    Notify (business hours)

    Non-critical synthetic failures, performance regressions, slow queries, dependency degradation. Routed to a tagged chat channel.

  3. 03

    Log (no notification)

    Anomalies for later review, low-priority warnings, and info-level events. Routed to a dashboard or log only.

The discipline

Reliability targets, and alerts people still read.

A Service Level Objective names what you measure, the success criterion, the target, and the window, for example 99.9% of homepage requests succeeding under 2 seconds over 30 days. Do not chase 100% or five-nines reflexively, because each nine costs an order of magnitude more, and 99.9% (about 43 minutes of downtime a month) is plenty for most marketing sites while 99.95% is reasonable for SaaS.

The error budget is the inverse: at 99.9%, a tenth of a percent of requests can fail. It creates a feedback loop between reliability and velocity, so ship aggressively while the budget is healthy, slow down when it is half-spent, and freeze risky changes once it is exhausted until reliability recovers.

Alert on symptoms, not causes: 'users are slow' is a symptom worth a page, while 'CPU is high' is a cause to investigate. Every paging alert needs a runbook, thresholds are rate-based against a baseline rather than absolute counts, and the system gets audited quarterly, because an on-call paged more than once or twice a week is where alert fatigue starts.

Reference files

The reference that goes alongside the SKILL.md.

  • references/slo-design-guide.md

    A detailed walkthrough of writing SLOs, error-budget policies, and the common SLO mistakes for web services.

Browse all reference files on GitHub

Bridges to other skills

What monitoring feeds and what it is not.

Monitoring is the early-warning system. These cover the response it triggers, the retro it feeds, and the adjacent measurement work.

  • When it fires

    incident-response

    A page is a detection source for incident response. Monitoring decides what wakes someone up; incident response is what they do once awake.

  • After the incident

    after-action-report

    An incident often reveals a monitoring gap, and the retrospective names it. The fix loops back into the next round of checks and alerts here.

  • Product metrics

    analytics-strategy

    Designing dashboards for product metrics is a different job. Monitoring watches system health; analytics measures what users do with the product.

  • Fixing the slowness

    performance-optimization

    The performance layer here alerts on regressions; fixing them in depth is performance work. Monitoring says it got slow, that skill makes it fast.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. This page is a structured overview; the SKILL.md is the source. MIT licensed.

Frequently asked questions.

What are the four monitoring layers?
Availability (is the site up), correctness (is it serving the right thing), performance (is it fast enough), and errors and anomalies (are errors happening even when it is up, correct, and fast). Each layer catches a different class of problem, so skipping one leaves a blind spot: availability checks alone will not catch a homepage that loads but renders blank, which is what the correctness layer's synthetic checks are for.
What should page someone versus go to a quiet channel?
Three tiers. Page (wakes someone) is for a site outage, a broken critical flow, an error-rate spike, or a security incident, and every page-tier alert must be actionable, important, and rare, under one or two a week. Notify (business hours) is for non-critical synthetic failures and performance regressions, routed to a tagged chat channel. Log (no notification) is for anomalies and low-priority warnings. If page-tier alerts fire frequently, alert fatigue sets in and people stop responding, so the page tier is guarded tightly.
How do I pick an SLO?
Tie it to user-visible behavior, make it achievable on current infrastructure, and measure it automatically. Do not aim for 100% or five-nines unless you genuinely need it, because each nine costs an order of magnitude more. 99.9% allows about 43 minutes of downtime a month and is plenty for most marketing sites; 99.95% (about 21 minutes) is reasonable for SaaS; anything higher needs significant infrastructure investment. An SLO names what you measure, the success criterion, the target percentage, and the window.
What is an error budget for?
It is the inverse of the SLO: at 99.9%, a tenth of a percent of requests are allowed to fail. Its value is the feedback loop it creates between reliability and velocity. When the budget is healthy, ship aggressively; when it is half-spent, slow down on risky changes; when it is exhausted, freeze risky changes until reliability recovers. That loop is what makes an SLO useful rather than a number on a dashboard.
How do I avoid alert fatigue?
Alert on symptoms, not causes ('users are slow' pages, 'CPU is high' is a cause to investigate), give every paging alert a runbook so the on-call knows what to do, and use rate-based thresholds against a baseline rather than absolute counts that a busy day will trip. Keep to three severity tiers rather than five, monitor from multiple regions so a regional outage is caught, and run a quarterly alert audit: any alert that fires more than once a week with low actionability gets tuned, reduced, or fixed at the source.