Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Metrics

Prometheus exposition on metrics_bind (default 127.0.0.1:9090/metrics).

Series

Names below are the on-wire names exactly as registered in src/observability/metrics.rs (observability::metrics::names) and sampled in src/observability/sampler.rs. Dashboard queries must use these names verbatim.

NameTypePurpose
uptimepage_checks_total{status}counterchecks completed, partitioned by terminal status (up/down/degraded/error)
uptimepage_checks_errors_total{kind}countererror breakdown by kind; currently only circuit_open is emitted (a check skipped because its host breaker was open)
uptimepage_check_redirects_total{outcome}counterHTTP redirect hops (followed / limit_exceeded / invalid_location / blocked_scheme)
uptimepage_circuit_breaker_state_changes_total{from,to}counterbreaker state transitions
uptimepage_storage_writes_total{store,result}counterbatcher flush outcomes
uptimepage_storage_dropped_results_total{reason}counterresults dropped before reaching the sink (queue full, etc.)
uptimepage_notifications_total{channel,kind}counteralert notifications dispatched
uptimepage_notifications_failures_total{channel}counternotification dispatches that returned an error
uptimepage_alerts_dropped_total{reason}counterincident paging signals dropped before reaching the escalation engine, by NotificationReason (opened/escalated/resolved/reopened/no_data/data_resumed). A lifecycle change never blocks on paging throughput, so a saturated signal channel drops here; the incident row stays in Postgres for the reconcile sweep
uptimepage_notifications_dead_lettered_total{transport}counterincident pages that exhausted all retries without delivering, by transport
uptimepage_telegram_send_deferred_totalcounterTelegram sends held back by the per-bot/per-chat send budget rather than sent immediately. Sustained growth means the central bot is rate-limit bound
uptimepage_host_throttle_waits_total{kind}counterper-(org,host,port) (kind=host) or per-TLD RDAP (kind=rdap) throttle acquire attempts
uptimepage_host_throttle_drops_totalcounterhost-bulkhead rejections — kind=host over-cap checks recorded as degraded without firing alerts. RDAP drops do NOT increment this counter; they fall through to the sticky last-good path (see domain_expiry_stale_served_total)
uptimepage_rdap_singleflight_total{outcome}counterRDAP singleflight outcome per domain — hit (cached, no outbound request) or miss (fetcher invoked)
uptimepage_domain_expiry_stale_served_total{kind}countertimes the domain-expiry executor served a cached last-good answer instead of a fresh probe. kind distinguishes the cause: throttled, timeout, lookup_error, or fresh_error (no usable last-good — emitted as a real Error instead)
uptimepage_domain_expiry_state_write_failed_totalcounterfailures writing the last-good cache row after a successful probe. Sustained values mean the sticky cache is going cold even though probes succeed — typical cause is Postgres write degradation
uptimepage_scheduler_refresh_failed_totalcounterregistry refresh ticks that returned an error from Postgres. Alert on a sustained rate above your normal noise floor; persistent failures put the scheduler into exponential backoff (capped at 10× the configured refresh interval) and keep workers running with cached ScheduledTarget snapshots
uptimepage_rdap_singleflight_slotsgaugelive entries in the in-process RDAP singleflight cache. Bounded under normal load by the set of monitored domains; sudden growth signals a code path feeding non-target domains into the cache
uptimepage_scheduler_consecutive_refresh_failuresgaugeconsecutive registry refresh failures since the last success. Primary alarm signal for a stuck scheduler — page when the value stays above 5 for more than a few minutes. Resets to 0 on the first successful refresh
uptimepage_scheduler_refresh_duration_mshistogramwall-clock duration of one registry refresh tick (Postgres query + decode + DashMap diff). p99 climbing into the hundreds of ms means the current full-scan refresh is starting to strain at scale — the trigger for the deferred incremental-sync work
uptimepage_build_info{version}counterset to 1 once at startup so the endpoint is never empty
uptimepage_check_duration_mshistogramper-check wall time. The uptimepage_check_*_ms family is exposed as histogram buckets (not summary quantiles) so percentiles aggregate correctly across regions; query with histogram_quantile()
uptimepage_check_dns_mshistogramDNS resolution latency (recorded in the hickory wrapper)
uptimepage_check_connect_mshistogramTCP connect latency (every HTTP check connects fresh)
uptimepage_check_tls_mshistogramTLS handshake latency (per HTTPS check)
uptimepage_check_ttfb_mshistogramtime-to-first-byte: request sent to response headers
uptimepage_storage_batch_sizehistogramflush batch sizes
uptimepage_storage_write_duration_mshistogramflush durations
uptimepage_telegram_send_wait_mshistogramwait imposed on a Telegram send by the send budget before its slot opened
uptimepage_targets_totalgaugetargets in this process’s scheduler registry (sampled). Non-zero only where in-process probing runs; a brain doing agent-only probing reports 0 by design — use uptimepage_targets_enabled for the configured-monitor count
uptimepage_targets_enabled{kind}gaugeconfigured enabled monitors counted from Postgres, by kind. Slow-cadence inventory gauge, scrape-cached so request load never reaches Postgres; correct on a brain regardless of where probing runs
uptimepage_users_activegaugenon-deleted user accounts counted from Postgres. Slow-cadence inventory gauge, scrape-cached
uptimepage_workers_in_flightgaugecurrent worker-pool semaphore depth (sampled). Emitted by every probing process, so on a brain doing agent-only probing the real value is on the agent’s role=probe series, not the brain’s near-zero one
uptimepage_result_queue_depthgaugedepth of the result channel buffer (sampled). Present on both the agent (egress to the control plane) and the brain (ingest to storage); separate them by role
uptimepage_circuit_breakers_opengaugecurrently-open breakers (sampled). Probe-side — read the role=probe series
uptimepage_monitors_unmonitoredgaugemonitors whose covering probes have all gone silent (no fresh results), from the silence sweep. Distinct from down: these have no data at all
uptimepage_agent_up{region,agent}gauge1 if a regional agent checked in within the staleness window, else 0. Emitted by the control plane from agents.last_seen_at, so it covers remote agents that Alloy can’t scrape. Per-agent series can freeze on agent removal, so alerts use uptimepage_agents_enabled_down
uptimepage_agent_last_seen_age_seconds{region,agent}gaugeseconds since a regional agent last checked in. Climbs unbounded when an agent goes dark
uptimepage_agents_enabled_downgaugecount of enabled regional agents currently past the staleness window. Recomputed every sweep so it never latches. The dead-man signal for a probe region going dark
uptimepage_region_agents_total{region}gaugeenabled agents configured for a region — the quorum denominator. Brain-side from the agents table
uptimepage_region_agents_up{region}gaugeenabled agents in a region fresh within the staleness window — the quorum numerator. up / total is the region’s health fraction; up == 0 means the region’s agents have all gone stale. Recomputed each sweep; like the per-agent gauges it can freeze if a region’s last agent is removed. Covers agents Alloy can’t scrape
uptimepage_region_checks_window{region}gaugechecks completed in a region over the recent sampling window. Brain-side count from ClickHouse, so it covers remote agents Alloy can’t scrape. Only regions with results in the window appear
uptimepage_region_checks_up_window{region}gaugechecks that returned up in a region over the recent window. Divide by uptimepage_region_checks_window for the success ratio
uptimepage_region_check_latency_p95_ms{region}gaugeapproximate p95 check latency in a region over the recent window, in ms. Goes stale for a dark region (no new rows), so gate panels on uptimepage_region_agents_up
uptimepage_pg_pool_sizegaugetotal connections held in the sqlx Postgres pool (idle + in-use). Bounded above by storage.postgres.max_connections
uptimepage_pg_pool_idlegaugeconnections sitting idle in the Postgres pool. A persistent idle = 0 alongside in_use at the max is the saturation signal
uptimepage_pg_pool_in_usegaugeconnections checked out of the Postgres pool right now (size − idle). Alert on a sustained high in_use / size ratio
uptimepage_process_resident_bytesgaugeresident set size of the process (VmRSS) in bytes. Linux only — absent on non-Linux dev runs. Early-warning signal for slow leaks ahead of the OOM killer
uptimepage_clickhouse_max_part_count_for_partitiongaugeClickHouse MaxPartCountForPartition (sampled from system.asynchronous_metrics). Partition-explosion early warning — climbs toward parts_to_throw_insert (default 3000) if a high-cardinality column is added to PARTITION BY
uptimepage_http_requests_total{method,route,status}counterinbound HTTP requests handled. route is MatchedPath (the path-pattern with placeholders) — cardinality bounded by the static router table, never by per-tenant ids. status is bucketed 2xx/3xx/4xx/5xx/other; query sum by (status) (rate(...[5m])) for the SLO ratio
uptimepage_http_request_duration_ms{method,route}histograminbound HTTP request latency, exposed as summary quantiles (single web instance, no cross-instance merge). Query name{quantile="0.99"} for tail latency per route
uptimepage_http_responses_inflightgaugeinbound HTTP requests currently being served. Climbing alongside flat throughput points at handler back-pressure on a downstream pool
uptimepage_ratelimit_drops_total{scope}counterHTTP 429s from the per-org / per-user rate-limit middleware. scope is the same string carried in the error body (per_org_api_writes, per_user_bulk_ops, …) so dashboards can join with record_quota_event audit rows. Abuse signal — a tenant hammering the API spikes one scope before shared resources notice

Scrape interval of 15 s is plenty — counters are written from hot tokio tasks; histograms aggregate per bucket without lock contention.

Histogram exposition. Two forms. The uptimepage_check_*_ms family is configured with explicit buckets and exported as a Prometheus histogram (name_bucket{le="..."} plus name_sum / name_count); query it with histogram_quantile(0.99, sum(rate(name_bucket[5m])) by (le)) so percentiles pool correctly across regional agents. Every other *_ms / *_size histogram keeps the default exposition, a Prometheus summary with precomputed quantile series (name{quantile="0.5|0.9|0.95|0.99|0.999"}) plus name_sum and name_count; query those as name{quantile="0.99"} directly. Gauges carry no org_id label, these are single-instance operator metrics, not per-tenant.

Scrape labels. The collector stamps two labels the app does not set: role (control-plane on the brain, probe on a regional agent) and, on probe series, region. The brain and a probe both emit the prober and pipeline metrics (check_*, workers_in_flight, circuit_breakers_open, result_queue_depth, storage_*, process_resident_bytes), so filter by role to read the one you mean rather than summing two processes. The Ops dashboard pins probe panels to role=probe and filters them by a $region variable; the Business dashboard reads the control-plane-only inventory gauges.

The uptimepage_region_* gauges are different: the brain emits them with a region label it sets itself (from the agents table and from ClickHouse), not a collector-stamped scrape label. They are the per-region surface on a SaaS control plane, where the regional agents are not scraped at all: liveness and quorum from the agents table (region_agents_up / _total), throughput and latency from ClickHouse (region_checks_window / _up_window / region_check_latency_p95_ms). One scrape point, cost scales with regions, not tenants or fleet size.

OpenTelemetry tracing

Spans are exported over OTLP/HTTP (protobuf) when both observability.tracing_enabled and observability.grafana.enabled are true. The exporter targets observability.grafana.otlp_endpoint (the OTLP base; /v1/traces is appended) and authenticates with Authorization: Basic base64(instance_id:api_key). The destination is any OTLP/HTTP collector — Grafana Cloud Tempo, Jaeger, an OpenTelemetry Collector, etc.

  • api_key is read only from UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY — never from a file.
  • Sampling is parent-based over a head ratio (grafana.trace_sample_ratio, default 0.05); a sampled parent keeps its children.
  • Resource attributes: service.name = uptimepage, service.version = the build version.
  • Disabled by default and zero-cost when off: no exporter is built, no network egress, no per-check overhead.
  • A batch exporter ships spans in the background; it is flushed and stopped on graceful shutdown so the final spans are not lost. A transport build failure logs a warning and the service continues without traces — telemetry never takes down monitoring.

Inconsistent settings (export on but endpoint/instance/key missing, or a sample ratio outside [0.0, 1.0]) fail fast at startup as a config error, not a runtime surprise. See Configuration for the keys and env overrides.

HTTP connection phase timings

Every HTTP check opens a fresh connection (no pool — a monitor probes each target once per interval, so a pool rarely reused a socket, and fresh-connect is what lets the probe attribute time to each phase). check_dns_ms, check_connect_ms, and check_tls_ms are timed during that establishment and check_ttfb_ms from request-send to response headers. The same four values are written per-check into ClickHouse, which is what powers the detail-page latency-breakdown chart.