Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

uptimepage

Async Rust service that runs HTTP and TCP health checks against a configurable set of targets, applies per-host circuit breaking, batches results, and ships them to durable storage. Targets persist in PostgreSQL; check results land in ClickHouse for high-cardinality time-series queries. Exposes a REST API for target CRUD and result queries, a server-rendered operator UI on the same port, and Prometheus metrics on a separate port.

Built on Rust 1.95 (edition 2024), Tokio, Axum, hyper-util (custom phase-timing connector + tokio-rustls), sqlx, and the official clickhouse crate. UI layer uses askama 0.16 + HTMX 2 + Tailwind 4 + ECharts 6, all served from the same binary. Designed for low-overhead checks at ~50k concurrent in-flight.

Where to start

Source

github.com/uptimepage/uptimepage

Architecture

Goals

  • Run periodic HTTP + TCP health checks against an arbitrary, mutable set of targets
  • Stay below 50 ms p99 overhead per check (excluding network)
  • Sustain ~50k concurrent in-flight checks per node
  • Survive transient target failures (per-host circuit breakers) and storage flaps (in-process retry + batching)
  • Graceful shutdown within 10 s without losing in-flight results

Module layout

src/
├── api/             REST handlers, router, OpenAPI doc, middleware
│   ├── docs.rs        utoipa OpenApi descriptor (/api/openapi.json + /docs SwaggerUI)
│   ├── error.rs       ApiError envelope + stable error code constants
│   ├── handlers/      one module per resource (targets, results, tags, dashboard, health)
│   ├── idempotency.rs DashMap-backed 24h cache + middleware for bulk + bulk-action
│   ├── middleware.rs  charset=utf-8 rewriter
│   ├── page.rs        PageEnvelope<T> + PageOfTarget / PageOfCheckResult / PageOfIncident / PageOfTagCount
│   ├── redaction.rs   credential redaction wrapper
│   ├── routes.rs      build_router + per-route layer wiring
│   └── types.rs       wire types not in domain/ (TagCount, DashboardSummary, BulkActionRequest, TestRequest, ...)
├── app.rs           AppState (storage + worker pool + caches)
├── bin/loadtest.rs  in-process load test driver
├── config.rs        typed configuration + env override loader
├── domain/          Target, CheckSpec, CheckResult, Incident + coalescing helper
├── error.rs         AppError + IntoResponse → ApiError envelope
├── http_client/     custom hyper-util client + phase-timing connector + hickory resolver
├── observability/   tracing + Prometheus + OTLP setup
├── pipeline/        result batcher
├── scheduler/       target registry + per-target tick loop
├── storage/         Postgres (targets) + ClickHouse (results) + in-memory test doubles
├── web/             askama 0.16 + askama_web HTML routes (dashboard, targets, forms, error pages)
│   ├── routes.rs      Router<AppState> merged into the main router in main.rs
│   ├── assets.rs      rust-embed handler for /static/* with cache-control
│   ├── auth.rs        session cookie scaffolding (v1.1 — no-op today)
│   ├── error.rs       AppError → HTML error page mapper (not the JSON envelope)
│   └── views/         one module per page (dashboard, targets_list, targets_detail, targets_form)
└── worker/          worker pool + circuit breaker + check executors

templates/           askama HTML (compiled into the binary)
└── ... base.html, dashboard{,/region}.html, targets/{list,detail,form}.html, error/{404,500,503}.html

static/              rust-embed bundle
├── css/             Tailwind 4 build output (built by build.rs)
└── js/              HTMX 2 + json-enc + ECharts 6 + tiny UI/chart modules under ui/ and charts/

The web layer is a thin server-rendered surface on top of the existing JSON API: every UI mutation hits /api/v1/* (forms post JSON, list/detail uses HTMX swaps of partials). See ui.md for operator-level details.

Data flow

                ┌────────────────┐
                │ REST API       │  target CRUD
                │ (axum + AppState)
                └────────┬───────┘
                         │ writes
                         ▼
                ┌────────────────┐
                │ PostgreSQL     │  target metadata
                └────────┬───────┘
                         │ TargetRegistry.refresh() every N seconds
                         ▼
                ┌────────────────┐
                │ Scheduler      │  one task per target, jittered tick
                └────────┬───────┘
                         │ dispatch
                         ▼
                ┌────────────────┐
                │ WorkerPool     │  semaphore-bounded, circuit-breaker-gated
                │  ├── http_check (hyper-util + hickory DNS)
                │  └── tcp_check  (tokio::net::TcpStream)
                └────────┬───────┘
                         │ CheckResult on mpsc channel
                         ▼
                ┌────────────────┐
                │ ResultBatcher  │  size + timeout flush
                └────────┬───────┘
                         │ write_batch
                         ▼
                ┌────────────────┐
                │ ClickHouse     │  check_results + 1-min agg MV
                └────────────────┘

On-demand checks (POST /targets/{id}/check-now and POST /targets/test) are dispatched to an agent in the target’s region over the agent’s held long-poll, and the request waits for the result. The agent persists check-now results (test results are returned but not stored). If no agent is currently serving the region the request returns 503 PROBE_UNAVAILABLE.

Key design choices

  • Two storage backends. Targets are low-cardinality, mutated by API operations → relational (Postgres) is the right fit. Results are append-only, high-cardinality, queried by time range → columnar (ClickHouse) keeps queries fast at 90-day retention.
  • Fresh-connect HTTP checks, two TLS modes. HttpClients holds two rustls TlsConnectors — verifying and insecure — plus the shared DNS cache and SSRF guard. There is no connection pool: a monitor probes each target once per interval (a pool rarely reused a socket), and connecting fresh per check is what lets the probe time DNS resolve, TCP connect, and TLS handshake separately (timed_connect in src/http_client/connector.rs) and write those phases into each result. The request runs over hyper::client::conn (h1/h2 by ALPN); the connection task is aborted once the body is read. Per-target verify_tls picks the connector at dispatch time.
  • Per-host circuit breakers. Failing hosts open their breaker quickly; subsequent checks fail fast with error=circuit_open without consuming a worker slot. Half-open probes after open_duration_secs.
  • Per-tenant host throttle (bulkhead). A fail-fast semaphore caps how many in-flight checks one tenant can run against the same (host, port). Bursts beyond the cap are recorded as degraded with error="throttled: host concurrency cap" and do not fire alerts — the upstream is fine, the back-pressure is operator-side. The cap is keyed per-tenant so one customer’s burst can never starve another’s monitor of the same host. RDAP carries its own per-TLD cap so one slow registry can’t correlate failures across every customer’s daily domain-expiry check.
  • Sticky last-good for domain-expiry probes. Each successful RDAP probe writes (expiry_at, registrar, last_success_at) to domain_expiry_state (PK target_id, denormalised org_id, FK CASCADE on the target). Every trait method requires OrgId and the row is filtered by both keys — a handler taking target_id from request input cannot read another tenant’s row. A subsequent transient failure — RDAP timeout, throttle drop, registry 5xx, 404 — does not flip the monitor: the executor reads the cached row and emits a CheckResult with the cached verdict. For Up the error field stays empty; for Degraded/Down it carries a served_stale: … annotation, so operators can tell the surface from a fresh probe. Cached rows older than 7d (measured against last_success_at, never advanced by failures) escalate to Error, which is alert-eligible. Cross-tenant singleflight (keyed by canonical domain) collapses concurrent probes for the same domain to one outbound request — RDAP is public registry data, coalescing across tenants is safe and IANA-friendly.
  • Bounded result channel. The mpsc between worker pool and batcher has a fixed buffer (storage.clickhouse.buffer_size). When full, the worker increments storage_dropped_total{reason="queue_full"} and drops the result. Back-pressure is explicit, not hidden.
  • Idempotent migrations. Postgres uses sqlx::migrate! (tracked in _sqlx_migrations). ClickHouse migrations are bare CREATE TABLE IF NOT EXISTS statements run at startup. No external migrator.
  • Shared DNS cache. A single hickory resolver instance is invoked directly by timed_connect; lookups cache per RFC TTL plus configurable bounds. Per-resolution latency is recorded into check_dns_ms.
  • Cancellation tokens for shutdown. The root token is cloned to scheduler, batcher, sampler, idempotency pruner, and graceful axum shutdown. SIGINT/SIGTERM cancels root; subsystems drain in tokio::join!.
  • Self-describing API. utoipa derives an OpenAPI 3.1 document at compile time, exposed at /api/openapi.json and rendered at /docs via Swagger UI. Every handler annotation carries at least one example. The 4xx/5xx error envelope and the list PageEnvelope are unified across every endpoint.
  • In-process caches with bounded TTL. The dashboard summary holds a 5-second parking_lot::Mutex<Option<(Instant, DashboardSummary)>> to absorb operator polling. The Idempotency-Key cache is a DashMap keyed by (header, body-hash) with a 24-hour TTL; a background pruner sweeps expired entries hourly.
  • Incident coalescing. A shared helper in domain/incident.rs consumes ordered (timestamp, status, error) tuples and emits Incident rows. Memory + ClickHouse storage call into the same logic; the ClickHouse path uses a narrow column projection to keep bandwidth low.

Concurrency model

  • One Tokio runtime, multi-threaded scheduler (default worker_threads = num_cpus)
  • One Tokio task per active target in the scheduler — sleeps interval ± jitter, dispatches, sleeps again
  • WorkerPool::execute spawns a new task per dispatch, gated by Arc<Semaphore> sized to max_concurrent_checks
  • Batcher is a single task with tokio::select! over channel-recv, timeout, and cancellation
  • Sampler is a single task that periodically reads gauge sources (pool semaphore counts, target count, breaker counts) and records into the metrics registry

Multi-region probes

By default one process is the whole system: it schedules and runs every check itself, in one region. A deployment can add regions by running extra processes as agents ([agent] enabled = true) — stateless probes with no database, web, or alerting. Each agent pulls its region’s decrypted monitor config from the control plane and POSTs results back; region is the partition key, so one agent per region needs no coordination. The control plane’s own region is a normal region row (scheduler.region), not a sentinel. Results carry their region + agent through both ClickHouse rollups, so reads can slice by region. Regions and agents are provisioned through the instance-admin /operator/* surface. See Multi-region probes for the full model, operator surface, and read-path behaviour.

REST API

Mounted under /api/v1 on the configured API bind. JSON in, JSON out. No authentication in v1 — bind to loopback or front it with a reverse proxy you trust.

OpenAPI 3.1 document at GET /api/openapi.json; Swagger UI at GET /docs.

All responses use Content-Type: application/json; charset=utf-8.

Response headers

  • POST /api/v1/targets (201) sets Location: /api/v1/targets/{id} so clients can follow up without re-deriving the path.
  • Cache-Control is stamped on every /api/v1/* response:
    • mutations (POST / PATCH / DELETE) → no-store
    • /api/v1/dashboard/summaryprivate, max-age=5 (matches the server-side cache)
    • all other reads → private, max-age=10

Endpoints

MethodPathPurpose
POST/api/v1/targetscreate one target
POST/api/v1/targets/bulkbulk-create up to 10,000 targets
POST/api/v1/targets/bulk-actionenable / disable / delete / tag-add / tag-remove on many ids
POST/api/v1/targets/testrun a one-shot check against a CheckSpec without persisting
POST/api/v1/targets/{id}/check-nowrun an immediate check using the target’s stored credentials
GET/api/v1/targetslist targets (limit, offset, tag, enabled, q) — paginated
GET/api/v1/targets/{id}get one target
PATCH/api/v1/targets/{id}update name, check spec, interval, enabled, tags
DELETE/api/v1/targets/{id}delete a target
GET/api/v1/targets/{id}/resultsrecent check results (from, to, limit, offset, region) — paginated
GET/api/v1/targets/{id}/latencybucketed latency series (from, to, region) — server-side quantiles + per-phase means
GET/api/v1/targets/{id}/latency/by-regionper-region latency series (from, to) — one series per region, for overlay charts
GET/api/v1/targets/{id}/uptimeuptime summary over a range (from, to, region)
GET/api/v1/targets/{id}/regionslist the regions a monitor probes from
PUT/api/v1/targets/{id}/regionsset the regions a monitor probes from
GET/api/v1/regionslist the enabled probe-region catalog (id, name, location)
GET/api/v1/targets/{id}/incidentscoalesced incident periods (from, to, ongoing_only) — paginated
POST/api/v1/targets/{id}/sharesmint a read-only share link; returns the share (token included)
GET/api/v1/targets/{id}/shareslist a monitor’s live share links (token included, re-copyable)
DELETE/api/v1/targets/{id}/shares/{share_id}revoke a share link
GET/api/v1/tagstag inventory with target counts (q prefix) — paginated
GET/api/v1/dashboard/summaryper-org rollup (5-second in-process cache, keyed by OrgId)
GET/healthzliveness — always 200 once the process is up
GET/readyzreadiness — pings the target store; 503 if unreachable
GET/api/openapi.jsonOpenAPI 3.1 document
GET/docsSwagger UI

Instance-admin and agent surfaces

Two surfaces sit outside /api/v1 with their own auth, used only for multi-region deployments:

  • /operator/* — instance-admin regions + agents CRUD, gated by a static bearer secret (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN); 404s when unset.
  • /api/agent/* — the pull/ingest endpoints an agent uses, authenticated by its sm_agent_… token (not a tenant api_token).

Both are documented in Multi-region probes.

Operator endpoints (maintenance + incident narration)

These mutate the public surface; they live under the same auth boundary as /api/v1/targets. Operator workflow + validation rules in Public status page.

MethodPathPurpose
POST/api/v1/maintenanceschedule a maintenance window
GET/api/v1/maintenancelist windows (status=active|upcoming|past|all, paginated)
GET/api/v1/maintenance/{id}get one window
PATCH/api/v1/maintenance/{id}edit title / description / time range / components (rejected after ends_at)
DELETE/api/v1/maintenance/{id}cancel a window
PATCH/api/v1/incidents/{id}update narration: public_title, public_description, severity (JSON null clears, omit to leave alone)
POST/api/v1/incidents/{id}/updatesappend a status update — phaseinvestigating/identified/monitoring/resolved/postmortem, message ≤ 2 000 chars

Operator endpoints (status pages)

An org owns one or more public status pages, each with its own slug, branding, and curated set of monitors. Reads are open to any active member; every mutation is owner-only. Scoped to the caller’s active org (a foreign page id is 404). Adding a monitor already on the page returns 409 COMPONENT_ALREADY_ON_PAGE — edit it with PATCH. Model + caps in Per-org status pages.

MethodPathPurpose
GET/api/v1/status-pageslist this org’s pages
POST/api/v1/status-pagescreate a page (capped at max_status_pages; slug globally unique)
GET/api/v1/status-pages/{id}one page + its live URL and logo URL
PATCH/api/v1/status-pages/{id}rename, change slug, publish/unpublish, edit branding
DELETE/api/v1/status-pages/{id}delete the page
GET/api/v1/status-pages/{id}/componentsthe monitors curated onto the page
POST/api/v1/status-pages/{id}/componentsadd a monitor (distinct-target cap max_public_components)
PATCH/api/v1/status-pages/{id}/components/{target_id}per-page public_name / public_description / public_group (JSON null clears)
DELETE/api/v1/status-pages/{id}/components/{target_id}remove a monitor from the page
POST/api/v1/status-pages/{id}/components/reorderset component order
POST/api/v1/status-pages/{id}/logoupload a logo (multipart)
DELETE/api/v1/status-pages/{id}/logoremove the logo

Public status endpoints

Unauthenticated; mounted at /api/public/v1/* and bypassed at Caddy via the @public matcher (see Deployment). Each response carries Cache-Control: public, max-age=10, stale-while-revalidate=30. A monitor not curated onto the page being served is invisible on every public surface — direct lookups return 404 and it never appears in any list. Wire types literally cannot serialise sensitive target fields (url, headers, basic_auth, bearer_token).

MethodPathPurpose
GET/statusserver-rendered HTML status page (?fragment=1 returns the dynamic region only)
GET/status/incidents/{id}per-incident detail page
GET/api/public/v1/statusthe same data as /status in JSON
GET/api/public/v1/components/{id}/historyper-component 90-day history (days query, default 90, max 90)
GET/api/public/v1/incidentsrecent public incidents (paginated)
GET/api/public/v1/incidents/{id}one public incident with its update timeline
GET/api/public/v1/incidents.rssRSS 2.0 feed of recent incidents
GET/api/public/v1/maintenanceactive + upcoming maintenance windows
GET/api/public/v1/badge.svgembeddable SVG status badge (overall, or ?component={id})

See Public status page for the operator workflow and the per-page component fields (public_name, public_description, public_group, sort_order) that drive what’s published.

A share link is a capability URL that renders one monitor’s full read-only detail view to anyone who has it, no account. Managing share links — mint, list, revoke — is a monitor action gated on member-level targets:write (not owner-only); listing returns the live token so a read-only caller can’t harvest working public links. Scoped to the caller’s active org (a foreign monitor id is 404). expires_at is optional; omit it for a link that never expires. The public surface those tokens unlock is documented in Share links.

MethodPathPurpose
POST/api/v1/targets/{id}/sharesmint a share; body { "label"?, "expires_at"? }, returns the MonitorShare
GET/api/v1/targets/{id}/shareslist live (non-revoked) shares
DELETE/api/v1/targets/{id}/shares/{share_id}revoke immediately — the link 404s on its next request

Both POST and GET return the token; build the link as /m/{token} (prepend your origin). The token stays re-copyable — it is stored encrypted at rest (the app KEK, same as basic_auth/bearer_token); the public resolve path matches on a separate hash, so a hot link never triggers a decrypt. token is null only when a row was sealed under a KEK that is no longer configured. Two plan caps apply (columns on plans, overridable per-org via plan_overrides): max_share_links_per_monitor (active links on one monitor) and max_shared_monitors (distinct monitors in the org that have any link). The free plan is 1 and 2. Exceeding either is 422 QUOTA_EXCEEDED (the body names the quota). A label longer than 80 characters is 400 SHARE_LABEL_INVALID; an expires_at in the past is 400 INVALID_EXPIRY.

Check specs

Tagged enum, type discriminator.

HTTP

{
  "type": "http",
  "url": "https://example.com/healthz",
  "method": "GET",
  "timeout": 5000,                              // ms, total request budget
  "follow_redirects": false,
  "max_redirects": 0,
  "expected_status": { "kind": "exact", "value": 200 },
  "expected_body_contains": null,               // optional substring match
  "headers": {},
  "body": null,
  "verify_tls": true,
  "basic_auth": null,                           // ["user", "pass"] or null
  "bearer_token": null
}

Credential redaction

GET, POST, PATCH, and bulk responses replace populated basic_auth / bearer_token fields with the sentinel "***". A null field stays null, so clients can distinguish “auth is configured” from “no auth”. When you PATCH a target’s check, you must re-supply the real credential — a body that contains "***" is rejected with 400 Bad Request. If you only need to change other fields (name, tags, enabled, interval), omit check from the PATCH body. Encryption at rest is gated on security.credentials_kek_base64; the redaction behavior applies in either mode.

expected_status variants:

{ "kind": "exact", "value": 200 }
{ "kind": "range", "value": { "min": 200, "max": 299 } }
{ "kind": "one_of", "value": [200, 204] }

Rate-limited responses

A response with 429 Too Many Requests or 503 Service Unavailable is recorded as degraded, not down — the upstream is telling us “I’m here, back off.” The error field carries rate-limited <code> (Retry-After: <value>) when the header is present so operators can size the polling interval against what the upstream actually wants. A check that explicitly accepts 429 / 503 via expected_status is honored first and stays up.

Some third-party APIs rate-limit by source IP regardless. GitHub’s unauthenticated REST API is the canonical case: 60 req/h per IP, 5 000 req/h with a token in the Authorization header. Poll those endpoints at ≥ 300 s, or attach the token via a header in this spec.

Per-host throttle

The worker side caps the number of concurrent checks one tenant can fan at the same (host, port) so a burst of monitors against one upstream doesn’t look like a probe. When the cap is reached, the over-cap check is recorded as degraded with error="throttled: host concurrency cap" and no alert fires — the upstream is fine, the back-pressure is operator-side. The cap is per-tenant: one customer’s burst never starves another customer’s monitor of the same host. Default cap is two in-flight per (org, host, port); tune via checker.per_host_max_inflight. RDAP queries (domain expiry) carry their own per-TLD cap via checker.rdap_max_inflight.

TCP

{ "type": "tcp", "host": "db.internal", "port": 5432, "timeout": 2000 }

TLS certificate expiry

{
  "type": "tls_cert",
  "host": "example.com",
  "port": 443,
  "server_name": null,         // optional SNI override; defaults to `host`
  "warn_days": 14,
  "critical_days": 7,
  "timeout": 5000
}

Opens a TCP connection, performs a TLS handshake against the host (accepting any presented chain so that expired or self-signed certs can still be inspected), and parses the leaf certificate’s notAfter. Status mapping:

  • days_remaining < 0 (expired) → down
  • days_remaining < critical_daysdown
  • days_remaining < warn_daysdegraded
  • otherwise → up

error carries a JSON document with days_remaining, not_after, subject_common_name, issuer_common_name. A handshake failure (plain-TCP host, network error) returns error status with the underlying message. warn_days must be strictly greater than critical_days. Floor is interval >= 3600 (enforced); default for a new monitor is 86400 (daily).

Domain expiration

{
  "type": "domain_expiry",
  "domain": "example.com",
  "warn_days": 30,
  "critical_days": 7,
  "timeout": 10000
}

Queries the IANA RDAP bootstrap registry to find the authoritative RDAP server for the domain’s TLD, then fetches /domain/<domain> and reads the events[?eventAction == "expiration"] entry. Status mapping is the same as TLS cert: < critical_daysdown, < warn_daysdegraded, else up. Non-up results carry a JSON error body with domain, days_remaining, expiration_date, and (when present) registrar.

The bootstrap registry is fetched lazily on the first lookup and cached for the lifetime of the process. The SSRF guard does not apply — the check’s network destination is an IANA-published RDAP server, not the user-supplied domain. Floor is interval >= 3600 (enforced); default for a new monitor is 86400 (daily). RDAP servers rate-limit clients — keep this near daily, not hourly. warn_days must be strictly greater than critical_days.

Target payload

{
  "name": "internal-api",
  "check": { /* check spec */ },
  "interval": 60,             // seconds between ticks; effective floor is
                              // max(plan.min_check_interval_secs, kind_min).
                              // kind_min is 10 for http/tcp/dns and 3600 for
                              // tls_cert/domain_expiry. Plan-free min = 60.
                              // 10 is the absolute DB CHECK hard floor.
  "enabled": true,
  "tags": ["prod", "tier1"],
  "alerts": { /* optional, see below */ }
}

Server returns the full Target including id (UUIDv7), created_at, updated_at, and write_source.

write_source is a read-only field recording where the resource was last written from: ui, api, or terraform (decided server-side from the request, never the body — sending it is ignored). It also appears on notification channels and maintenance windows, and drives the “managed by” badge in the web UI. A write through any endpoint restamps it, so it reflects the most recent author.

Alert config

alerts is an optional array of channel bindings. Each binding is just a reference to a notification channel (see Notification channels); the firing policy lives on the monitor itself. An empty/omitted array disables channel alerting for that target (incidents still open and show on status pages).

"alerts": [
  { "channel_id": "0192a1ce-0000-7000-8000-000000000001" },
  { "channel_id": "0192a1ce-0000-7000-8000-000000000002" }
],
"alert_confirmations": 3,
"notify_recovery": true,
"renotify_interval_secs": 3600,
"region_policy": "majority"
  • channel_id — id of a notification channel owned by the same org. A binding to an unknown or another tenant’s channel is rejected.
  • alert_confirmations — consecutive failing checks before an incident opens (and the same number of passing checks before it closes, which damps flapping). Default 2, must be >= 1.
  • notify_recovery — when true (default), the recovery is announced to the monitor’s channels. When false, recovery is silent.
  • renotify_interval_secs — seconds between reminder notifications while an outage stays unacknowledged. 0 disables reminders; otherwise must be >= 60. Default 3600. Acknowledging or resolving the incident stops the reminders.
  • region_policy — how many probe regions must agree the target is down before an incident opens: "any", "majority" (default), "all", or { "count": N }.

Notifications are driven by the incident engine: one notification per incident open (then reminders per renotify_interval_secs), one on recovery. Failed deliveries retry on exponential backoff and dead-letter after the attempt cap; per-incident delivery state is visible at GET /api/v1/incidents/{id}/notifications.

Alert validation errors

POST and PATCH return 400 Bad Request (INVALID_ALERT_CONFIG) for:

  • a duplicate channel_id in the array
  • notification channel <id> does not exist — unknown id, or one owned by another org
  • alert_confirmations must be >= 1
  • renotify_interval_secs must be 0 (off) or at least 60

A region_policy of { "count": N } where N is 0 or exceeds the available regions is 422 INVALID_REGION_POLICY.

Validation errors

POST and PUT return 400 Bad Request for:

  • Unsupported URL scheme (url scheme '...' not allowed — only http and https)
  • Missing URL host, empty TCP host, or TCP/TLS port 0
  • tls_cert warn_days must be > critical_days
  • domain_expiry domain must contain a TLD label (no dot in domain)
  • domain_expiry warn_days must be > critical_days
  • SSRF guardtarget address ... is in a blocked range. Triggered when the URL or TCP host is an IP literal that resolves to loopback / private / link-local / reserved space (see Configuration → security.allow_private_targets). Hostname literals are checked again at connect time after DNS resolution, so DNS rebinding cannot bypass the guard.
  • Redaction sentinelbasic_auth contains redaction sentinel — re-supply the real credential or the equivalent for bearer_token. Rejected to prevent a GETPATCH round-trip from silently overwriting the stored credential with "***".
  • TLS verification + credentialsverify_tls = false cannot be combined with basic_auth or bearer_token over https. When verification is disabled any host presenting a forged certificate can collect the stored credential on every check interval. Set verify_tls = true (recommended) or remove the credential from the target.

Notification channels

Per-org delivery destinations that targets bind to via their alerts array. Org scoping is implicit in the caller’s authenticated context — one tenant can never read, mutate, or test another’s channels.

MethodPathPurpose
POST/api/v1/notification-channelsCreate a channel (201 + Location)
GET/api/v1/notification-channelsList the org’s channels
GET/api/v1/notification-channels/{id}Get one
PATCH/api/v1/notification-channels/{id}Partial update
DELETE/api/v1/notification-channels/{id}Delete (204); also removes the channel’s alert bindings from every monitor
POST/api/v1/notification-channels/testTest an unsaved transport config
POST/api/v1/notification-channels/{id}/testSend a synthetic test alert through a saved channel
POST/api/v1/notification-channels/{id}/resend-verificationResend the verification mail for an unverified email channel
{
  "name": "Ops Slack",
  "enabled": true,
  "config": { "type": "slack", "webhook_url": "https://hooks.slack.com/services/T/B/XXXX" }
}

config is type-tagged. Supported transports:

  • slack{ "type": "slack", "webhook_url": "https://…" } (incoming webhook; posts { "text": "…" })
  • discord{ "type": "discord", "webhook_url": "https://discord.com/api/webhooks/…" } (channel webhook; posts { "content": "…" } with ?wait=true so delivery failures surface synchronously; text capped at 2000 chars)
  • msteams{ "type": "msteams", "webhook_url": "https://….logic.azure.com/…" } (Teams Workflows webhook; posts an Adaptive Card. Retired O365 connector URLs are not accepted)
  • google_chat{ "type": "google_chat", "webhook_url": "https://chat.googleapis.com/v1/spaces/…" } (space webhook; posts { "text": "…" }, capped at 4096 chars)
  • webhook{ "type": "webhook", "url": "https://…", "headers": { … }, "secret": "…" } (POSTs the alert JSON; optional custom headers; optional signing secret, see below). The escape hatch: no host restrictions, for services the named kinds don’t cover
  • telegram{ "type": "telegram", "bot_token": "…", "chat_id": "…" } (bring-your-own bot)
  • telegram_app{ "type": "telegram_app", "chat_id": "…", "chat_title": "…" } — linked through the platform’s central bot. Not creatable from request bodies: a POST/PATCH/test carrying this kind returns 422 CHANNEL_KIND_MANAGED (the chat id rides the operator bot’s credentials, so accepting one would let any caller page an arbitrary chat). Channels of this kind are created only by the link-code flow below.
  • whatsapp{ "type": "whatsapp", "access_token": "…", "phone_number_id": "…", "to": "…", "template_name": "…", "language_code": "en" } (Business Cloud API; language_code optional, default en)
  • whatsapp_app{ "type": "whatsapp_app", "phone": "…", "profile_name": "…" } — linked through the platform’s WhatsApp number. Not creatable from request bodies (422 CHANNEL_KIND_MANAGED, same rationale as telegram_app); created only by the WhatsApp link-code flow below.
  • pagerduty{ "type": "pagerduty", "routing_key": "…" } (the 32-character Events API v2 integration key of a PagerDuty service). The only transport that drives the destination’s own incident lifecycle: opens/reopens/escalations send trigger and resolution sends resolve, all correlated by dedup_key = the incident id, so one uptimepage incident maps to exactly one PagerDuty alert that opens and closes with it. Severity maps Critical→critical, Major→error, Minor→warning. A test send fires a trigger+resolve pair on a throwaway dedup key and never leaves an open PagerDuty incident
  • ntfy{ "type": "ntfy", "server_url": "https://ntfy.sh", "topic": "…", "access_token": "tk_…" } (JSON publish to the server root; server_url optional, defaults to ntfy.sh, must be the bare server root; access_token optional, sent as a Bearer token). High-urgency opens publish at priority 4, the rest at 3; resolves tag white_check_mark, opens rotating_light. On ntfy.sh an unprotected topic’s name is its only access control
  • pushover{ "type": "pushover", "token": "…", "user": "…", "device": "…" } (30-character application token and user/group key, both treated as secrets; device optional). High-urgency alerts go out at priority 1 (bypasses quiet hours), low at 0, resolves at −1 (no sound). Emergency priority 2 is not used
  • sms{ "type": "sms", "provider": "twilio", "to": "+15551234567", "from": "…", … } — bring-your-own SMS gateway; one text message per alert, body trimmed to a few segments to bound per-segment cost. to is E.164; from is an E.164 number or sender id. The provider-specific credentials are: twilioaccount_sid + auth_token; telnyxapi_key (+ optional messaging_profile_id); vonageapi_key + api_secret; plivoauth_id + auth_token; sinchservice_plan_id + api_token + region (us/eu/au/br/ca, default us). Only the gateway secret is treated as a secret (Twilio/Plivo auth_token, Telnyx api_key, Vonage api_secret, Sinch api_token); account identifiers stay visible
  • email{ "type": "email", "to": "oncall@example.com" } — one lowercase address per channel, delivered through the platform’s transactional sender. Verification-gated: the channel is created unverified and a mail with a single-use 24 h link is sent to the address; until the link is confirmed every delivery (incident page or test send) fails with email address not verified. Replacing the config resets the gate and re-sends the mail. POST /api/v1/notification-channels/{id}/resend-verification re-sends it (capped per channel and per org per day — 422 CHANNEL_VERIFICATION_LIMIT; on a non-email channel — 422 CHANNEL_NOT_VERIFIABLE); a test against an unverified or unsaved email config is 422 CHANNEL_UNVERIFIED.

Webhook signing. When a webhook channel carries a secret (≥ 16 characters), every delivery is signed: the request includes X-Uptimepage-Timestamp (unix seconds) and X-Uptimepage-Signature: sha256=<hex>, where the hex is HMAC-SHA256(secret, "{timestamp}.{body}") over the exact bytes sent. Receivers should recompute the digest and reject stale timestamps (e.g. older than 5 minutes) to block replays. Channels without a secret deliver unsigned.

WhatsApp templates. Create a one-parameter utility template (body {{1}}) in the WhatsApp Business Manager and set template_name (plus language_code, which must match the template’s exact language — en and en_US are distinct). The alert text is sent as that single parameter, collapsed to one line. A template is required: WhatsApp accepts free-form text only within 24 hours of the recipient’s last message, and out-of-window sends are accepted by the API yet dropped asynchronously — a silent-loss mode an alerting channel must not have.

Behaviour:

  • Secrets sealed at rest with the credentials KEK; never echoed back. Every read path masks secret-bearing fields with *** (the webhook URL is masked whole — it can carry a token; header names and chat_id are kept so the UI stays useful).
  • Redaction-sentinel guard: submitting a config that still contains *** returns 400 REDACTION_SENTINEL. Omit config on PATCH to keep the stored secret unchanged.
  • Validation (400): every webhook URL must be https; the provider-branded kinds are additionally host-pinned (discorddiscord.com/discordapp.com with an /api/webhooks/ path, msteams*.logic.azure.com/*.powerplatform.com, google_chatchat.googleapis.com) and a URL elsewhere is rejected with a hint to use the generic webhook kind; telegram requires non-empty bot_token and chat_id; whatsapp requires access_token, a numeric phone_number_id, an international-format to, and a template_name (lowercase/digits/underscore); email requires a lowercase single-address to; pagerduty requires a 32-char alphanumeric routing_key; ntfy requires an https root-only server_url and a 1–64 char topic (letters/digits/_/-); pushover requires 30-char alphanumeric token and user; sms requires an E.164 to, a from, and the selected provider’s credentials (Twilio account_sid is AC + 32 hex; Plivo auth_id and Sinch service_plan_id are alphanumeric; Sinch region is one of us/eu/au/br/ca); channel name is required and ≤ 100 chars.
  • Destination deny-list: the customer-controlled outbound URL (slack/discord/msteams/google_chat/webhook/ntfy’s server_url) is checked against the platform’s abuse deny-list on create, update, and both test endpoints — a match is rejected (ABUSE_BLOCKED / DOMAIN_DENYLISTED). telegram/whatsapp/email/pagerduty/pushover/sms deliver to fixed vendor endpoints.
  • Quota: capped per org by the plan’s max_notification_channels (atomic, advisory-locked). A duplicate name within the org is 422 CHANNEL_NAME_TAKEN; the cap is 422 CHANNEL_QUOTA_EXCEEDED.
  • Test sends deliver one clearly-labelled synthetic alert. The per-channel form tests the stored config (works on a disabled channel too); the collection-level POST …/test takes { "config": { … } } in the body, validates it exactly as create would, and persists nothing — the UI uses it for “test now” before a channel is saved. A transport failure is 422 CHANNEL_TEST_FAILED. Both count against the test_now rate-limit bucket.
  • Platform disables: when a linked Telegram chat unlinks from its side (the bot is removed, or the chat sends /stop), every channel linked to that chat is disabled with a disabled_reason the UI shows. Re-enabling the channel clears the note.

Telegram one-tap linking

Deployments running the central bot expose a link-code flow (absent — 404 TELEGRAM_LINK_NOT_FOUND — otherwise):

  • POST /api/v1/notification-channels/telegram-link (channels:write) with an optional { "name": "…" } hint mints a single-use code (15-minute expiry, capped outstanding codes per org → 422 TELEGRAM_LINK_LIMIT). The response carries the raw code (shown once, only its hash is stored), a deep_link (t.me/<bot>?start=<code>, private chat) and a group_deep_link (?startgroup=<code>, picks a group). The same code works for either destination.
  • Sending the code to the bot (tap Start, or /link <code> in a group) creates the telegram_app channel for the minting org. The org is resolved only from the code — never from the Telegram payload.
  • GET /api/v1/notification-channels/telegram-link/{id} (channels:read) polls the code: pending, consumed (with channel_id), or expired.
  • Unlink = delete the channel; deleting the last channel linked to a group also walks the bot out of that group. From the chat side, /stop or removing the bot disables the channel (see platform disables above).

WhatsApp one-tap linking

Deployments with the operator WhatsApp number enabled expose the same flow (absent — 404 WHATSAPP_LINK_NOT_FOUND — otherwise):

  • POST /api/v1/notification-channels/whatsapp-link (channels:write) with an optional { "name": "…" } hint mints a single-use code (15-minute expiry, capped per org → 422 WHATSAPP_LINK_LIMIT). The response carries the raw code and a deep_link (wa.me/<number>?text=<code>) that opens WhatsApp with the code prefilled.
  • Sending the prefilled message creates the whatsapp_app channel for the minting org, bound to the sender’s number. The org is resolved only from the code — never from the webhook payload.
  • GET /api/v1/notification-channels/whatsapp-link/{id} (channels:read) polls the code: pending, consumed (with channel_id), or expired.
  • Unlink = delete the channel; from the phone side, sending stop disables every channel bound to the number (platform disable, reason shown in the UI).

The person who owns the Slack workspace / Telegram group / inbox often isn’t the person configuring monitors — a delegation link hands off just the connect step.

  • POST /api/v1/notification-channels/delegate (channels:write) with optional { "name": "…", "kind": "…" } hints mints a single-use /c/<code> URL (7-day expiry, capped outstanding links per org → 422 DELEGATE_LINK_LIMIT; unknown kind400 DELEGATE_KIND_INVALID). Only the code’s hash is stored.
  • GET /c/<code> is public and chrome-less: it offers exactly the connect-capable transports of the deployment — the telegram one-tap link + QR (the delegation code doubles as the t.me start payload), “add to Slack” / “add to Discord” when the operator OAuth apps are configured, and a manual webhook/address form. The link can create one channel in the inviting org and read nothing; expired, revoked, and spent codes all render the same 404 page. Every delegated create lands in the org audit log.
  • GET /api/v1/notification-channels/delegate (channels:read) lists the org’s links (pending / consumed / expired); DELETE /api/v1/notification-channels/delegate/{id} (channels:write) revokes an unconsumed one (revoked links read as expired).

Rate limiting

/api/v1/* is rate-limited per authenticated subject — by (org, category) and by (user, category), whichever trips first — with the per-minute budgets taken from the org’s plan. Categories: api_writes (POST/PATCH/DELETE), api_reads (GET/HEAD/OPTIONS), bulk_ops (/bulk*), test_now (/test), check_now (/check-now). Exceeding a budget returns 429 Too Many Requests with a Retry-After header (seconds until the next token) and code: RATE_LIMITED. /healthz and /readyz are never throttled. Unauthenticated and per-IP limiting is the reverse proxy’s job (see Deployment). Full model: Quotas & rate limits.

CORS

Disabled by default. When api.cors.enabled = true, /api/v1/* answers preflight OPTIONS with Access-Control-Allow-Origin (matching allowed_origins or * when allow_any_origin = true), Access-Control-Allow-Methods (the configured list), and Access-Control-Allow-Headers: content-type. /healthz and /readyz carry no CORS headers regardless.

Error envelope

Every 4xx and 5xx response uses one wire shape:

{
  "error": {
    "code": "INVALID_URL_SCHEME",
    "message": "url scheme 'ftp' not allowed",
    "field": "check.url",
    "details": null,
    "trace_id": null
  }
}
  • code is stable, machine-readable, UPPER_SNAKE_CASE. Never repurposed once published.
  • field is a JSON pointer to the offending input for 400s; null for non-field errors.
  • details carries optional structured context (e.g., { "range": "127.0.0.0/8" } for SSRF rejections).
  • trace_id is the W3C traceparent when tracing is enabled.

Common codes: INVALID_URL_SCHEME, INVALID_URL_FORMAT, SSRF_BLOCKED, INVALID_INTERVAL, INVALID_TIMEOUT, INVALID_TCP_PORT, INVALID_TCP_HOST, INVALID_STATUS_RANGE, INVALID_TLS_CERT_PARAMS, INVALID_DOMAIN_PARAMS, INVALID_TLS_CRED_COMBO, INVALID_ALERT_CONFIG, REDACTION_SENTINEL, BULK_EMPTY, BULK_TOO_LARGE, BAD_TIME_RANGE, TARGET_NOT_FOUND, CHANNEL_NOT_FOUND, CHANNEL_NAME_TAKEN, CHANNEL_NAME_INVALID, CHANNEL_QUOTA_EXCEEDED, INVALID_CHANNEL_CONFIG, CHANNEL_TEST_FAILED, CIRCUIT_OPEN, DEPENDENCY_DOWN, INTERNAL.

Quota, rate-limit and abuse codes

CodeHTTPMeaning
QUOTA_EXCEEDED422A plan quota would be exceeded. details carries quota (e.g. max_targets, max_members, max_public_components), current, limit, plan.
MIN_CHECK_INTERVAL422Requested check interval is below the effective floor (max(plan.min_check_interval_secs, kind_min)), where kind_min is 3600 for tls_cert / domain_expiry and 10 for http / tcp / dns. Enforced on create, bulk, and PATCH.
INVITATIONS_LIMIT409The org is at its pending-invitation cap.
RATE_LIMITED429A per-minute rate budget was exceeded. Retry-After (seconds) is set; details.scope names the tier, e.g. per_org_api_writes.
ABUSE_BLOCKED400Target blocked by abuse protection. details.reason explains.
URL_PATTERN_BLOCKED400Target URL matched an abuse pattern (recon path).
DOMAIN_DENYLISTED400Target domain (or a parent) is on the deny-list.

See Quotas & rate limits for the quota model, the per-minute categories, and the deny-list policy.

Pagination envelope

Every list endpoint returns:

{ "items": [ /* ... */ ], "total": 1240, "limit": 50, "offset": 0 }

limit defaults to 50 for /targets and /tags, 1000 for /results, 100 for /incidents. limit is silently capped server-side (10,000 for results, 1,000 for incidents/tags). total reflects rows matching the filters, ignoring limit/offset.

Results query

GET /api/v1/targets/{id}/results?from=2026-05-12T00:00:00Z&to=2026-05-12T23:59:59Z&limit=100&offset=0

  • from / to default to the last 24 h; to must be strictly greater than from (400 BAD_TIME_RANGE otherwise).
  • Returns a PageEnvelope of CheckResult ordered by timestamp DESC.

Latency series

GET /api/v1/targets/{id}/latency?from=…&to=…

Pre-bucketed quantiles and per-phase means read straight from the per-minute rollup — powers the monitor-detail latency line and phase-breakdown area charts. The server divides the range into ~60 slices (floored to the 60-second rollup grain), so any range returns a comparably dense series and the cost stays O(buckets), not O(samples). Switching range re-scales the buckets.

  • from / to default to the last 24 h; to must be strictly greater than from (400 BAD_TIME_RANGE).
{
  "bucket_seconds": 1440,
  "buckets": [
    {
      "t": 1747137600000,      // unix-ms at bucket start (JS new Date(t))
      "p50": 120, "p95": 180, "p99": 240,
      "avg": 130,              // mean total; breakdown chart derives "processing" = avg − (dns+connect+tls+ttfb)
      "dns": 12, "connect": 20, "tls": 35, "ttfb": 60,  // mean per-phase ms; 0 for kinds that skip the phase
      "samples": 24            // 0 marks a gap the chart leaves unconnected
    }
  ]
}

bucket_seconds is always a multiple of 60 (1h→60, 24h→1440, 7d→10080, 30d→43200).

Region filter

results, latency, and uptime accept an optional region= query parameter to scope the read to one probe region; omit it for an all-regions view. Region ids are the slugs registered via the operator surface. See Multi-region probes.

Per-region latency series

GET /api/v1/targets/{id}/latency/by-region?from=…&to=…

Same bucketing and cost as /latency, but split by region so each can be overlaid as its own line — powers the monitor-detail overlay chart. One entry per region that has samples in the range; each region’s buckets use the same shape as /latency.

{
  "bucket_seconds": 1440,
  "regions": [
    { "region": "default",  "buckets": [ /* LatencyBucket… */ ] },
    { "region": "eu-west",  "buckets": [ /* LatencyBucket… */ ] }
  ]
}

Uptime query

GET /api/v1/targets/{id}/uptime?from=…&to=…

{ "total": 8640, "up": 8635, "down": 0, "degraded": 0, "error": 5, "uptime_pct": 99.94 }

Incidents query

GET /api/v1/targets/{id}/incidents?from=…&to=…&ongoing_only=false&limit=100&offset=0

Returns coalesced down / error periods. A contiguous run of bad statuses becomes one incident; an up result between two bad runs splits them. Ongoing incidents return ended_at: null and duration_secs: null.

{
  "items": [
    {
      "id": "01h7m8z4n6v0e1m7v7y6x8x8x8",
      "target_id": "01h7m...",
      "started_at": "2026-05-13T11:30:00.000Z",
      "ended_at":   "2026-05-13T11:35:00.000Z",
      "status":     "down",
      "duration_secs": 300,
      "check_count": 5,
      "error_sample": "connection refused"
    }
  ],
  "total": 1, "limit": 100, "offset": 0
}

Tags inventory

GET /api/v1/tags?q=prod&limit=100

Returns every tag currently in use across the caller’s targets (enabled or disabled), with target count, sorted by descending count then alphabetical. q is a prefix filter for autocomplete. Scoped to the active org — in SaaS mode another org’s tags are invisible.

{ "items": [ { "name": "prod", "count": 12 }, { "name": "staging", "count": 4 } ],
  "total": 2, "limit": 100, "offset": 0 }

Dashboard summary

GET /api/v1/dashboard/summary — per-org rollup cached in-process for 5 seconds (keyed by OrgId, so two tenants never share an entry).

{
  "targets":        { "total": 42, "enabled": 40, "disabled": 2 },
  "current_status": { "up": 38, "down": 1, "degraded": 1, "error": 0, "unknown": 2 },
  "last_24h":       { "checks_total": 50400, "checks_up": 50360, "uptime_pct": 99.92, "incidents": 3 },
  "system":         { "in_flight_checks": 5, "result_queue_depth": 12, "dropped_results_last_5m": 0, "circuit_breakers_open": 0 }
}

On-demand operations

  • POST /api/v1/targets/test — runs one check against a raw CheckSpec, no persistence. Same SSRF / URL-scheme / port validation as POST /targets. Returns TestResponse { result, matched_expectations, warnings }.
  • POST /api/v1/targets/{id}/check-now — runs one check against an existing target using its stored credentials, dispatched to an agent in the target’s region. Result is persisted. Returns 503 PROBE_UNAVAILABLE if no agent is currently serving the region.
  • POST /api/v1/targets/bulk-action — apply one action atomically to up to 10,000 ids. Partial failure allowed; the response lists succeeded and failed separately, with per-id code + message.
{
  "ids": ["01h7m...", "01h7n..."],
  "action": { "type": "disable" }
  // alternatives: { "type": "enable" }, { "type": "delete" },
  //   { "type": "tag_add",    "tags": ["frozen"] },
  //   { "type": "tag_remove", "tags": ["frozen"] }
}

Idempotency

POST /api/v1/targets/bulk and POST /api/v1/targets/bulk-action accept an optional Idempotency-Key header. The server stores the response for 24 hours keyed by (header value, body hash). A retry with the same key and body returns the original response without re-executing. A retry with the same key but a different body executes normally — the body hash is part of the cache key. The cache is in-process; entries are lost on restart.

POST /api/v1/targets/bulk-action HTTP/1.1
Idempotency-Key: 01h7m8z4n6v0e1m7v7y6x8x8x8
Content-Type: application/json

{ "ids": ["..."], "action": { "type": "disable" } }

Terraform

Manage your monitors and notification channels as code with the official Terraform provider, uptimepage/uptimepage.

The Terraform Registry page is the full reference — every resource, attribute, and data source, regenerated from the provider on each release. This page is a quick start; it links out rather than duplicating that reference.

Quick start

terraform {
  required_providers {
    uptimepage = {
      source = "uptimepage/uptimepage"
    }
  }
}

provider "uptimepage" {
  token = var.uptimepage_token # or set UPTIMEPAGE_TOKEN
  org   = "your-org-slug"      # or set UPTIMEPAGE_ORG
  # endpoint defaults to https://app.uptimepage.dev; set it for a self-hosted instance
}

resource "uptimepage_target" "api" {
  name     = "api prod"
  interval = 60
  check = {
    type = "http"
    http = {
      url             = "https://example.com/healthz"
      expected_status = { kind = "exact", exact = 200 }
    }
  }
}

Credentials

  • Token — create one at Settings → API tokens (/settings/api-tokens; requires a verified email). Supply it via the token attribute or the UPTIMEPAGE_TOKEN environment variable. The full token is shown once. Grant the least scope the provider needs: targets:write + channels:write covers both managed resources (write implies read, and Terraform only deletes during terraform destroy). Add targets:delete + channels:delete only if you run destroy. For defence in depth, bind the token to the org you manage so a leak can’t reach your other orgs.
  • Org — API tokens are user-scoped, so every request must name an organization. Set org (the org slug) or UPTIMEPAGE_ORG; it is sent as the X-Uptimepage-Org header. Without it the API returns 400 ORG_REQUIRED. Find your slug from GET /api/v1/orgs or your dashboard URL. A token bound to an org requires org to match it (else 403 ORG_HEADER_MISMATCH).
  • Endpoint — defaults to the hosted API at https://app.uptimepage.dev. For a self-hosted instance, set endpoint to your host (the apex marketing domain does not serve /api/v1).

Resources & data sources

NameKindManages
uptimepage_targetresourceMonitors — http, tcp, tls_cert, domain_expiry, dns checks
uptimepage_notification_channelresourceAlert destinations — webhook, slack, telegram, whatsapp. The pagerduty/ntfy/pushover/sms kinds land in a provider release after the API ships them. The one-tap telegram_app and whatsapp_app kinds are not manageable: their configs are minted by the link flows and the API rejects them in request bodies (CHANNEL_KIND_MANAGED)
uptimepage_targetdata sourceLook up an existing target by id

For the full attribute reference and an example per check type, see the provider docs on the Terraform Registry.

Managed-by badge

Resources the provider creates or updates carry a terraform source marker (the provider identifies itself on every request). The web UI shows a small terraform chip next to those monitors and channels, plus a banner on the monitor detail page, so anyone browsing knows the resource is managed as code.

The marker is informational — the UI does not lock the resource. But an edit made in the UI flips its badge to ui and will be overwritten the next time you run terraform apply, since your .tf files remain the source of truth. Change managed resources in Terraform, not the UI.

Source

Provider source and issue tracker: https://github.com/uptimepage/terraform-provider-uptimepage.

Web UI

The same Rust binary that serves /api/v1/* also serves a server-rendered HTML UI on the same port. Open http://<host>:<api-port>/ in a browser.

Stack

LayerWhatWhere
Templatesaskama 0.16 + askama_web 0.16 (compile-time HTML, type-checked by cargo build)templates/
InteractivityHTMX 2.0.9 + json-enc (partial swaps, JSON form submission — no SPA framework)static/js/htmx.min.js, static/js/json-enc.js
ChartsECharts 6 (lazy-loaded only on pages that need it)static/js/echarts.min.js, static/js/charts/
CSSTailwind 4.3 (CSS-first config via @import, @source, @theme, @layer)static/css/input.cssapp.css
Asset servingrust-embed — assets are baked into the binary at compile timesrc/web/assets.rs

After cargo build --release you have one ~23 MB executable that contains every template, every CSS byte, and every vendored JS file. No node, no bundler, no separate frontend service.

Routes

PathPurpose
GET /Dashboard. Auto-refreshing region polls /web/partials/dashboard every 5 s; donut + 24h bar pull from /api/v1/dashboard/summary.
GET /targetsTargets list. Filter by name (client-side), tag, enabled. Row delete + paginate via HTMX. Rows authored by an API token or Terraform carry a managed-chip (api / terraform); UI-authored rows show none.
GET /targets/{id}Target detail. Status badge, four time-range presets (1h/24h/7d/30d), uptime KPIs, latency p50/p95/p99 line, DNS/connect/TLS/TTFB stacked area, recent-results table, redacted JSON config. Externally-managed monitors also get a managed-by chip and a banner warning that UI edits may be overwritten on the next apply.
GET /targets/newCreate form. Posts JSON to /api/v1/targets. Detection (open-incident-after-N-fails, region quorum) and Notifications (channel bindings, remind-while-down cadence, notify-on-recovery) are separate sections; the notification controls only render when the org has channels.
GET /targets/{id}/editEdit form. Same template as new but data-mode="edit"; credential fields land in redacted mode and the operator must explicitly toggle “Replace credentials” before new values are sent.
GET /web/targets/listHTMX partial (<tbody> fragment) for filter/paginate swaps on the targets list.
GET /settings/notificationsNotification-channel list. Send-test / edit / delete are HTMX row actions against /api/v1/notification-channels; the table body polls /web/partials/settings/notifications every 60 s.
GET /settings/notifications/new, …/{id}/editChannel create/edit form (Slack / Discord / Teams / Google Chat / generic webhook / Telegram / WhatsApp / SMS; the provider-branded cards take just the provider’s webhook URL, host-checked on create; the SMS card carries a gateway sub-selector — Twilio / Vonage / Telnyx / Plivo / Sinch — and takes that gateway’s own credentials). With provider OAuth configured, the slack/discord panel grows an “add to Slack” / “add to Discord” button (plus a QR variant for a signed-in phone): the provider’s consent screen picks the destination channel and the callback lands on the ready-made channel’s edit page; cancelling, a failed exchange, or the plan’s channel limit bounce back to the form with a quiet note. On deployments running the central Telegram bot, a one-tap telegram card joins the lineup (the BYO card reads telegram bot): “connect telegram” mints a single-use code, shows it as a t.me link + QR with a private-chat/group toggle, polls until the chat presses Start, then opens the channel the webhook created. Linked channels are display-only (chat title + id, no secrets, no replace toggle); unlink = delete. If the chat side unlinks first (bot removed, /stop), the channel is disabled with a visible “unlinked from the Telegram side” note that re-enabling clears. The Telegram panel has a setup helper: a t.me QR for the bot (scan, press Start) and a one-click chat-id probe, both talking to the Bot API straight from the browser. “Test now” delivers a synthetic alert before saving (create posts the form config to …/test; a locked edit tests the stored channel by id). On edit the stored secret stays masked behind a “Replace transport config” toggle — leaving it off omits config from the PATCH and locks the type cards (the kind rides the config). The edit page also lists the monitors bound to the channel, lets a “+ add monitor” picker bind more (it updates the monitor’s alert bindings through PATCH /api/v1/targets/{id}), and offers delete with that blast radius spelled out — deleting a channel also removes its bindings from every monitor.
GET /settings/pagesStatus-pages list — create / rename / publish / delete pages (free plan: one). Create posts to /api/v1/status-pages; the list body refreshes via /web/partials/settings/pages.
GET /settings/pages/{id}Per-page editor: URL slug (own save — a rename is a hard cutover), branding, logo, and the component curation list (per-monitor on-page toggle, public name/group). Each edit autosaves via the /api/v1/status-pages/{id} + /components endpoints.
GET /settings/teamTeam management (owner-only): invite by email + role, pending-invitation revoke, member remove / leave, owner⇄member role toggle — all row actions confirm via modal and hit /api/v1/orgs/{id}/members + /api/v1/orgs/{id}/invitations. Non-owner members see a read-only note.
GET /web/partials/settings/teamHTMX partial — seats line + members + pending-invitations tables; re-pins the target org id on every refresh.
GET /web/partials/settings/pagesHTMX partial — the page rows for the list above.
GET /web/partials/dashboardHTMX partial — chrome-free dashboard region; self-rearms so each refresh still carries hx-trigger="every 5s".
GET /m/{token}Public read-only share of one monitor — same detail dashboard, no operator chrome, credentials redacted, no login. Sub-resources (/live, /incidents, /latency, /results) are twinned under the token so the page never calls an operator URL. See Share links.
GET /docsSwagger UI generated from /api/openapi.json.
GET /static/{path}Embedded assets (css/, js/, img/).

Every mutation goes through /api/v1/*. There are no /web/* write routes — the JSON API stays the single source of truth, which means a future SvelteKit port is a templates-only rewrite. The /m/{token} share surface is read-only and serves no write method.

Build pipeline

cargo build [--release]
   └─► build.rs
         ├─► (first build only) scripts/fetch-tailwind.sh — downloads the Tailwind
         │     standalone CLI (~30 MB, not committed) for the host platform into bin/
         └─► ./bin/tailwindcss --minify
                --input  static/css/input.css
                --output static/css/app.css
   └─► rustc
         └─► rust-embed bakes static/ + templates/ into the binary

build.rs declares rerun-if-changed on templates/, src/, static/css/input.css, and scripts/fetch-tailwind.sh. Editing any of them triggers a Tailwind rebuild on the next cargo build.

Tailwind 4 scans both templates/**/*.html and src/**/*.rs for utility class names (declared via @source in input.css), so utility classes written inside Rust strings are preserved through tree-shaking.

Styling: the semantic layer

static/css/input.css is layered: design tokens (@theme, e.g. --color-ink) → primitives (.sticker-card, .sticker-btn, .sticker-pill) → semantic classes (.page-title, .panel-label, .kpi-value, .stat-tile, .status-badge--*, .btn-ghost, .sticker-btn--primary/--danger, .nav-link, .day-cell). Templates reference only the semantic names — no raw colour/shape utility clusters. State is one --modifier (.status-badge--down, .stat-tile--ok). Result: re-skinning the internal app is an input.css-only edit, no template touched. When adding UI, reuse/extend a semantic class rather than inlining bg-*/rounded-*/heading-scale clusters. The public status page is deliberately exempt — it’s a flat, brand-themed surface with its own view-supplied palette (public_status.rs), not the cartoon sticker system.

Dashboard refresh model

The dashboard splits into three regions:

  1. Chrome (nav, page header) — rendered once.
  2. Auto-refresh region (<div id="dashboard-region">) — KPI cards + system-health card. Polls /web/partials/dashboard every 5 s and swaps its own outer HTML so the trigger remains armed.
  3. Charts (donut + 24h composition bar) — placed outside the refresh region so the ECharts instances persist across polls. The chart wrapper listens for htmx:afterSettle on the region and re-fetches /api/v1/dashboard/summary once per cycle, fanning out to both charts (single network round-trip, not one per chart).

The dashboard_summary handler caches its result in state.dashboard_cache for 5 s, so the polling load on Postgres + ClickHouse is bounded to one query set per 5 s regardless of how many tabs are open.

Credential redaction

For basic_auth and bearer_token the form runs a three-state machine in static/js/ui/auth_field.js:

data-modeInputsSubmit behaviour
createenabled, emptyField included in POST body if filled.
redacteddisabled, sentinel *** shownField omitted from PATCH body.
replacingenabled, emptyField included with the real value.

The API rejects the *** sentinel on write as defence-in-depth — but the state machine prevents the form from ever submitting it. End-to-end coverage in tests/web_e2e_test.rs::edit_form_shows_redacted_auth_state_for_existing_target asserts that real credentials never appear in the rendered edit form.

Tests

LayerWhat
Unit (template render)Every view in src/web/views/ ships a #[test] that renders the template with a fixtures struct and asserts on the output: HTMX hooks, redaction sentinels, chart data-endpoints, table scaffolding.
End-to-endtests/web_e2e_test.rs drives the merged api+web router via tower::ServiceExt::oneshot, covering dashboard (full + partial), list (full + partial), forms (create + redacted-edit), target detail with chart anchors + time-range nav, 404 paths, and the immutable cache header on /static/*.
Build-timecargo build rejects template type mismatches — askama checks templates against the corresponding Rust struct at compile time.
cargo test --lib web::          # unit render tests
cargo test --test web_e2e_test  # end-to-end

Adding a new page

  1. Add a template under templates/ extending base.html.
  2. Add a #[derive(Template, WebTemplate)] struct and an axum handler in src/web/views/.
  3. Register the route in src/web/routes.rs.
  4. Tailwind picks up new utility classes automatically (the @source directive scans templates/**/*.html + src/**/*.rs).
  5. Add a render test next to the view and, if there’s a route worth covering end-to-end, append a case to tests/web_e2e_test.rs.

Troubleshooting

SymptomLikely cause
failed to spawn ./bin/tailwindcss during cargo buildFirst-build fetch failed. Run bash scripts/fetch-tailwind.sh manually and confirm bin/tailwindcss is executable.
Page renders unstyled HTMLstatic/css/app.css empty or stale. Touch static/css/input.css and rebuild; the build script runs Tailwind with --minify.
Charts render blankOpen DevTools console. Most likely a fetch to /api/v1/dashboard/summary or /api/v1/targets/{id}/results failed — the chart module logs chart load failed with the URL and status.
Dashboard never refreshesConfirm <script defer src="/static/js/htmx.min.js"> is in the page source. The HTMX bundle is loaded from base.html.
Edit form submitted credentials despite the toggle being offLook for a console error from auth_field.js. The submit handler reads data-mode from the credential <fieldset> — if the fieldset is missing the data attributes, it will fall back to “include”.

Migrating to a SPA later

The design keeps a SPA port cheap. Every templates/*.html maps one-to-one to a Svelte (or React) component, every chart module under static/js/charts/ is already a pure (element, endpoint) → disposer function that imports unchanged into onMount, and there are zero /web/* write endpoints to refactor — only read partials. To swap frameworks:

  1. Generate a typed JSON client from /api/openapi.json.
  2. Port the templates page-by-page; keep /api/v1/* unchanged.
  3. Drop src/web/views/ (keep src/web/assets.rs pointing at the new bundle).
  4. Delete templates/ and static/js/{htmx,json-enc,ui} — no longer needed.

The backend (src/api/, src/storage/, src/scheduler/, src/worker/) stays untouched.

Public status page

The public status page is the customer-facing surface — an unauthenticated HTML page at /status plus a small JSON + RSS API under /api/public/v1/*. It’s the only part of uptimepage that’s safe to expose on the open internet without basic auth in front of it.

This chapter is for operators: how to publish a component, narrate an incident, and schedule a maintenance window. For the wire-level details of the underlying endpoints see REST API. For Caddy + the rate-limit plugin see Deployment.

Multi-tenant operators read this first. This chapter describes the page itself; the workflow is identical on every page. In a multi-tenant deployment each org runs one or more pages at {slug}.{base_domain} — set tenancy.subdomain_public_routes = true and leave tenancy.path_based_public_routes off. The path-based /status surface is single-org and is for single-tenant deploys only (the default). See Per-org status pages for the routing, branding, and isolation model, and Public status routing for the flag matrix.

What’s published vs what’s private

By default every target is private. A monitor becomes a “component” on a status page only when it is curated onto that page — there is no per-target “public” flag. The aggregator filters at the SQL layer (a page renders only the monitors bound to it) and the wire types literally cannot serialise sensitive fields (url, headers, basic_auth, bearer_token are not part of any public schema), so a misconfiguration cannot leak credentials.

A monitor is published by adding it to a page; the per-page presentation lives on that binding, so the same monitor can appear on several pages under different names:

Per-page fieldPurpose
(binding exists)the monitor appears as a component on that page
public_namedisplay name on this page; falls back to the operator-side monitor name when unset
public_descriptionoptional one-liner shown under the component name
public_groupoptional group label; components with the same value cluster together. Ungrouped components render last
sort_orderinteger sort key within a group (ASC); the reorder endpoint rewrites it

A page belongs to an org and is managed by that org’s owner; see Per-org status pages for the page model, the max_status_pages / max_public_components caps, and isolation.

Enabling a component

The quickest path is the UI: open the page in Settings → Pages → {your page}. The editor lists every monitor in the org; toggle one on page, optionally set a Public name (blank shows the real monitor name) and a Group. Each edit autosaves via the components API below.

For scripting, add the monitor to the page, then set its per-page curation:

# Add monitor $TARGET_ID to page $PAGE_ID
curl -X POST http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components \
  -H 'content-type: application/json' \
  -d '{"target_id": "'$TARGET_ID'", "public_name": "Public API", "public_group": "Core APIs"}'

# Edit the per-page name / description / group later
curl -X PATCH http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components/$TARGET_ID \
  -H 'content-type: application/json' \
  -d '{"public_description": "Primary REST surface, all regions."}'

# Remove it from the page
curl -X DELETE http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components/$TARGET_ID

On the PATCH, public_name, public_description, and public_group use the same three-state semantics as incident narration: omit the field to leave it unchanged, send a string to set it, or send JSON null to clear it back to the default (real monitor name / no group). Blanking the field in the UI clears it for you.

Adding a monitor that’s already on the page is an idempotent no-op. Adding a brand-new monitor when the org is at its max_public_components cap is a quota error; a monitor already published on another page costs nothing to add here.

The page is cached for 10 s in-process (moka single-flight, with a second moka last-known-good cache so transient ClickHouse failures don’t break the page). Changes appear on the next refresh.

Narrating an incident

The background incident writer opens an incident automatically when a public target trips the threshold; it closes it again when checks recover. Both events happen without operator action. What’s manual is the narration — the human-readable title, description, severity, and the running timeline of “investigating → identified → monitoring → resolved” entries that show up on /status and in the RSS feed.

Update the title + severity:

curl -X PATCH http://127.0.0.1:8080/api/v1/incidents/$INCIDENT_ID \
  -H 'content-type: application/json' \
  -d '{
    "public_title": "Elevated 5xx in EU-WEST",
    "public_description": "Origin rollout regression — rolling back.",
    "severity": "major"
  }'

Sending JSON null for public_title or public_description clears the field and lets the page fall back to its auto-generated wording. Omitting the field leaves it unchanged.

Append a status update to the timeline:

curl -X POST http://127.0.0.1:8080/api/v1/incidents/$INCIDENT_ID/updates \
  -H 'content-type: application/json' \
  -d '{
    "phase": "identified",
    "message": "Rolled back the offending deploy. Verifying recovery."
  }'

phase is one of investigating, identified, monitoring, resolved, postmortem. Posting resolved does not end the incident — the incident lifecycle is driven by check results, so manual “resolved” entries are advisory only. Posting an update to an already-ended incident is allowed (useful for postmortems).

Validation rules:

FieldRuleError code
public_titlenon-whitespace, ≤ 200 chars (use JSON null to clear)EMPTY_TITLE / TITLE_TOO_LONG
public_description≤ 5 000 chars (use null to clear)DESCRIPTION_TOO_LONG
message (update)non-whitespace, ≤ 2 000 charsEMPTY_MESSAGE / MESSAGE_TOO_LONG
phase (update)exactly one of the five values above400 / 422 from the JSON extractor

Scheduling maintenance

A maintenance window is a planned outage. While the window is active, the page renders affected components as Maintenance (the truth-table rule is: maintenance dominates outage, so a real failure during the window still classifies as Maintenance, not MajorOutage). On the 90-day history strip, any day that overlapped a maintenance window renders as a maintenance cell rather than an outage cell.

Create:

curl -X POST http://127.0.0.1:8080/api/v1/maintenance \
  -H 'content-type: application/json' \
  -d '{
    "title": "PG13 → PG16 cutover",
    "description": "Read-only for ~30 minutes.",
    "starts_at": "2026-05-14T22:00:00Z",
    "ends_at":   "2026-05-14T23:00:00Z",
    "component_ids": ["01a7b1ce-0000-7000-8000-000000000001"]
  }'

List, edit, delete:

curl 'http://127.0.0.1:8080/api/v1/maintenance?status=upcoming&limit=10'
curl -X PATCH http://127.0.0.1:8080/api/v1/maintenance/$ID \
     -H 'content-type: application/json' \
     -d '{"title": "PG cutover (postponed)"}'
curl -X DELETE http://127.0.0.1:8080/api/v1/maintenance/$ID

Validation rules:

FieldRuleError code
titlenon-whitespace, ≤ 200 charsEMPTY_TITLE / TITLE_TOO_LONG
description≤ 5 000 charsDESCRIPTION_TOO_LONG
ends_atstrictly after starts_atINVALID_TIME_RANGE
ends_at - starts_at≤ 30 daysINVALID_DURATION
component_idsevery id must reference an existing targetINVALID_COMPONENT_ID
PATCH on a window whose ends_at is already pastrejected422 MAINTENANCE_COMPLETED

For audit, prefer PATCHing a cancelled window’s title (e.g. "[cancelled] PG cutover") over hard-deleting historical entries.

What the public page renders

  • Banner — one of All Systems Operational, Maintenance in progress, Minor Service Disruption, Partial System Outage, Major System Outage. Driven by the worst component state, with maintenance precedence as described above.
  • Component groups — each component shows its current state, a 90-day history strip (one cell per day, oldest-first), and the operator-supplied description.
  • Active and recent incidents — operator-set public_title if present, otherwise an auto-generated "<component> <status>" string. Each incident links to a permalink at /status/incidents/{id} with the full timeline.
  • Maintenance — active + the next 7 days of upcoming windows.
  • RSS feed/api/public/v1/incidents.rss. RSS 2.0; each item is a public incident with the latest update as the description.

Refresh behaviour

The page is statically rendered and works without JavaScript. With JS enabled, an HTMX hx-trigger="every 30s" swaps the dynamic region (the banner, the component grid, and the incident lists) without a full page reload. The chrome around it — header, footer, RSS link — stays put. A small (~35 LoC) static/js/public/tz.js helper rewrites ISO timestamps into the visitor’s local timezone tooltip; everything else is plain HTML.

Caddy and the rate-limit plugin

The public surface bypasses basic auth at the Caddy layer through an @public matcher in deployment/Caddyfile. The matcher also applies a per-IP rate limit (60 requests / minute), which requires the caddy-ratelimit plugin. The stock caddy:2-alpine image doesn’t include it — build a custom-caddy:2 image once via xcaddy. The procedure is in Deployment and deployment/README.md.

If you’d rather not maintain a custom Caddy image, comment out the rate_limit { … } block in the Caddyfile. The public surface still serves; you just lose per-IP throttling. Putting Cloudflare in front of Caddy is the other option.

Embeddable status badge

GET /api/public/v1/badge.svg returns a shields.io-style SVG badge that operators can embed in README files or external dashboards. Two modes:

<!-- Overall page status -->
![status](https://status.example.com/api/public/v1/badge.svg)

<!-- Single component -->
![api status](https://status.example.com/api/public/v1/badge.svg?component=<uuid>)

The badge reuses the cached page payload, so it tracks the /status view inside the 10-second cache window. Unknown component ids return 404 with the public error envelope; only style=flat is recognised (others return 400).

The page editor renders ready-to-copy markdown for the overall badge and each on-page component. The copyable URL is built from the page’s public origin, so on path-based/self-host deploys set auth.public_base_url to the externally reachable URL (the same value subscriber links need); otherwise the badge URL points at localhost.

?component=<uuid> works for any public component regardless of check type — an HTTP, DNS, or TLS-certificate monitor each gets its own badge that reflects that component’s current status.

Common questions

Can I have a component that’s public but doesn’t trigger incidents? No. Incident materialisation walks the same binding the page does — a monitor on any enabled page is eligible for incidents. If you want a check that’s published but not alerting, set enabled = false on the alert channels — the incident will still open, but no notification fires.

Can I publish a maintenance window without listing the affected components? No. component_ids may be empty in the request body, but the aggregator filters maintenance windows that touch zero public components out of the page (and out of the JSON), so they wouldn’t appear anywhere. List at least one public component.

What’s the cache TTL? 10 s. Single-flight: only one task computes the page when the entry expires; others wait for the result. On ClickHouse failure the last-known-good snapshot serves until the next successful compute.

How long does the 90-day history go back? Exactly 90 days, oldest day on the left. Cells with no recorded checks render as NoData (grey); the aggregator does not fabricate data.

Is there an Atom feed? No, RSS 2.0 only. Most feed readers consume both.

Per-org status pages

Each org owns one or more public status pages. A page lives at {slug}.{base_domain} in SaaS mode (acme.example.com, status.acme.example.com, …, apex-wildcard shape) and renders only the monitors that org has curated onto it, with that page’s branding, incidents, and maintenance. A new org starts with one default page (slug = the org slug) created at signup; the owner can rename it, add more pages, or take any page offline.

The number of pages an org can run is plan-capped (max_status_pages); the free plan gets one. Multiple pages let an org split surfaces — e.g. a public page and a separate internal-stakeholder page — each showing a different subset of monitors under a different URL.

This chapter is the per-org / per-page model. For the component, incident, and maintenance workflow (identical on every page) see Public status page. For the wildcard cert and reverse-proxy setup see Deployment and the full runbook in deployment/README.md.

When it applies

ShapeConfigPublic surface
Single-tenanttenancy.path_based_public_routes = true (default)the lone org’s default page, served path-based at /status on the operator host
Multi-tenant SaaStenancy.subdomain_public_routes = true, tenancy.path_based_public_routes = falseevery enabled page at {slug}.{base_domain}

Single-tenant deploys never pay the subdomain path: there is one live org, so its default page is mounted on the operator host at /status.

Path-based and subdomain public routes are mutually exclusive — serving /status on the operator host alongside subdomains would publish one page’s data at every tenant’s expected URL. Pick one.

Host routing

A page is resolved from the request Host header, not the path. The slug names a page, not an org; the lookup admits only enabled pages whose org is not soft-deleted.

HostResult
acme.example.com, page enabledthat page
acme.example.com, page disabled (draft) or org soft-deleted404
nope.example.com, no such page slug404
a.b.example.com (extra label)404
example.com (no slug label, bare base)404
missing Host header404

A page slug is globally unique (it routes a subdomain), so two orgs can never claim the same slug. base_domain must be a multi-label domain (it needs at least one dot); the boot assertion refuses an empty or single-label value, because a loose base would let the slug extractor match arbitrary Host headers.

The apex wildcard *.{base_domain} DNS record plus a wildcard TLS cert (Let’s Encrypt via the Hetzner DNS-01 challenge) means a new page works the instant it is enabled — no per-page DNS or cert step. Operator subdomains (app.{base_domain}, mail.{base_domain}, …) use explicit DNS records that take precedence over the wildcard, and the operator host is kept on its own per-host cert.

Managing pages

The org owner manages pages from the operator UI at /settings/pages (a list to create / rename / publish / delete pages) and the per-page editor at /settings/pages/{id} (URL slug, branding, logo, and which monitors appear). The same operations are available over the API:

EndpointPurpose
GET /api/v1/status-pageslist this org’s pages
POST /api/v1/status-pagescreate a page (capped at max_status_pages)
GET /api/v1/status-pages/{id}one page + its live URL and logo URL
PATCH /api/v1/status-pages/{id}rename, change slug, publish/unpublish, edit branding
DELETE /api/v1/status-pages/{id}delete the page (its component bindings cascade)
GET /api/v1/status-pages/{id}/componentsthe monitors curated onto the page
POST /api/v1/status-pages/{id}/componentsadd a monitor to the page
PATCH /api/v1/status-pages/{id}/components/{target_id}set per-page name / description / group
DELETE /api/v1/status-pages/{id}/components/{target_id}remove a monitor from the page
POST /api/v1/status-pages/{id}/components/reorderset the component order
POST /api/v1/status-pages/{id}/logoupload a logo (multipart)
DELETE /api/v1/status-pages/{id}/logoremove the logo

Every route is scoped to the caller’s active org: a page id that isn’t in that org resolves to 404 (the same cloak as the rest of the API), so an owner of one org can neither see nor mutate another org’s page.

Page identity and branding

FieldRuleDefault when unset
name1–80 chars; the operator-facing label in the Pages list (not shown publicly)— (required)
slugglobally-unique subdomain slug; 3–30 chars, lowercase letters / digits / hyphens, starts with a letter. A rename is a hard cutover — the old URL stops working immediately— (required)
enabledpublished? a draft (false) 404s on its public hostoff on create via the API; the signup default page is on
public_display_name1–80 charsthe org’s name
public_brand_color#RRGGBB (6-digit hex)#3b82f6
public_aboutMarkdown, ≤ 500 chars, rendered to sanitised HTMLomitted
public_styleone of the named themesdefault
public_show_powered_byfooter attribution toggleon
logoPNG / JPEG / WebP, ≤ 1 MB, ≤ 1200 px; larger images are downscaled. Format is sniffed from the bytes (declared content-type ignored — a script/SVG can’t masquerade as an image) and the decoder is allocation- and dimension-bounded against decompression bombsheader shows the display name as text

A PATCH with a branding object replaces the display fields wholesale; name, slug, and enabled are independent partial fields. The logo has its own endpoints and is never touched by a branding edit. The editor shows the live URL so the owner can preview exactly what visitors see.

Curating components

A monitor appears on a page only while a status_page_components binding exists for that (page, target) pair. Adding the monitor in the editor creates the binding; removing it deletes the binding. The per-page curation lives on the binding, so the same monitor can sit on several pages under different names:

Per-page fieldPurpose
public_namedisplay name on this page; falls back to the operator-side monitor name when unset (1–80 chars)
public_descriptionoptional one-liner under the component name (≤ 200 chars)
public_groupoptional group label; same value clusters together, ungrouped renders last (≤ 50 chars)
sort_orderinteger sort key within a group (ASC); the reorder endpoint rewrites it

The per-page distinct-target cap is max_public_components: it counts unique monitors across all of the org’s pages. A monitor already published on one page costs nothing to add to another; a brand-new monitor at the cap is rejected with a quota error. Adding a monitor already on the page is an idempotent no-op; adding a page or target that isn’t in the caller’s org is a 404, not a quota error.

About text

public_about is Markdown. It is parsed and then run through an HTML sanitiser before it ever reaches a template: only p, strong, em, a, br, ul, ol, li survive, links get rel="noopener nofollow", and there is no raw-HTML escape hatch. Scripts and inline styles are stripped.

Brand colour

The colour is validated at three independent layers — the database constraint, the application validator, and again in the template right before it is written into the page’s <style>. Any value that isn’t a strict 6-digit hex falls back to the default at render time, so a relaxed constraint at one layer can’t open a CSS-injection path on its own.

Logo storage

An uploaded image’s format is detected from its bytes, not its declared content type. The on-disk filename is derived from the page and a hash of the content, never from anything the client sends, so a crafted filename can’t escape public_status.logo_dir. Replacing or removing a logo deletes the previous file.

Caching and turning a page off

Each rendered page is cached for public_status.cache_ttl_secs (default 10 s), keyed by page id. A separate last-known-good layer keeps the most recent successful render per page so a transient Postgres/ClickHouse blip serves slightly stale data instead of an error. That layer is bounded by cache_max_orgs and idle-evicts after last_good_ttl_secs, so churn through many pages can’t grow it without limit.

Unpublishing a page (enabled → false) makes the host resolver stop resolving its slug; the cache entry idles out, so the page is a 404 within one TTL window at most. Deleting a page or soft-deleting the org has the same effect (the purge worker handles the org case).

Security model

  • Published only. The public host resolver admits a page only when it is enabled and its org is not soft-deleted. A draft or deleted page’s slug resolves to 404 even though the string still exists. The authenticated org lookup is a separate function and is never used on the public path.
  • Operator sessions never reach status subdomains. The session cookie is host-only (auth.session.cookie_domain = ""), so the browser scopes it to the operator host and never sends it to *.{base_domain}. The binary refuses to boot if cookie_domain is set to a parent zone that would overlap the apex wildcard.
  • No operator surface on the page. The status page renders no operator UI, sets no cookies, and never echoes request auth headers.
  • Tenant isolation. A request for one page returns only that page’s curated monitors; the page cache and every data source are keyed by page id, and the underlying queries bind the org id, end to end. A monitor not bound to the page is never queried for it, so its operator-side name can’t leak.

Configuration

The [public_status] block and the split tenancy flags are documented in Configuration → Public status page and Configuration → Multi-tenancy mode.

Coming later: custom domains

Today every page is served under the shared *.{base_domain} apex wildcard. A future release will let an org point its own hostname (e.g. status.theirbrand.com) at a specific page:

  • the org adds a CNAME to {slug}.{base_domain} and registers the custom hostname on the page’s settings;
  • the reverse proxy issues a per-hostname certificate on demand (no wildcard for custom domains — each is a distinct name);
  • host resolution gains a custom-domain → page lookup ahead of the subdomain parser; everything downstream (cache, branding, isolation) is unchanged.

This is intentionally additive: the subdomain path keeps working as the always-available default, and nothing in the current data model blocks it. Custom domains are not available yet — track the roadmap before promising a customer a vanity status URL.

Share links

A share link gives anyone a read-only window into a single monitor — no account, no login. Open /m/{token} and you get the same detail view a logged-in member sees: live status, uptime, latency and response-time charts, recent check results, and the incident history. Paste the link in a chat channel, drop it in a ticket, or send it to a customer who needs to watch one endpoint without access to your org.

It is distinct from a status page: a status page is a branded, curated, multi-monitor public surface on its own subdomain; a share link is a capability URL to one monitor’s full dashboard.

What a viewer sees

Everything the operator detail view shows, with two deliberate differences:

  • Read-only. No edit, delete, run-check-now, enable/disable, or navigation to the rest of the app. The page is its own shell with none of the operator chrome.
  • Credentials redacted. The monitor’s check configuration is shown (so a viewer can see what is being checked and how), but any bearer_token or basic_auth is replaced with ***. The live credential never reaches the page.

The page auto-refreshes its live region and charts just like the operator view, scoped entirely to the token — it never calls an operator or API URL.

The token

Minting a link returns a 256-bit random token; the URL is /m/{token}. The token is the capability — anyone holding it can view the monitor, and forwarding the link grants access. The controls are revoke (kill it now) and an optional expiry (kill it at a set time); a link with no expiry lives until revoked or the monitor is deleted.

The link is re-copyable, like a Google Docs or Dropbox share link: open the Share modal (or the list endpoint) any time to copy the same URL again. Lost the chat you posted it in? Copy it again — you only get a new token when you revoke and create one.

Limits come from the org’s plan (plans columns, overridable per-org): the free plan allows 1 active link per monitor and shares on at most 2 distinct monitors per org. Revoke a link to free a slot.

To make that possible the token is stored encrypted at rest with the app KEK (the same Cipher that protects basic_auth/bearer_token), so a raw database or backup dump without the key yields nothing usable. The public lookup matches on a separate one-way hash, so a hot link never triggers a decrypt. With no KEK configured the token is stored in plaintext (same fallback as target credentials); if a token was sealed under a key that is later removed, the link shows as un-copyable rather than broken.

A bad, expired, revoked, or deleted-monitor token all return the same 404 — there is no signal that distinguishes “wrong token” from “revoked token”, so the surface cannot be enumerated.

From the API (member-level targets:write):

# Mint a link (optionally labelled, optionally expiring)
curl -X POST https://app.example.com/api/v1/targets/$ID/shares \
  -H 'Content-Type: application/json' \
  -d '{"label":"Slack #ops","expires_at":"2026-12-31T00:00:00Z"}'
# → { "id": "...", "label": "Slack #ops", "token": "…", "view_count": 0, ... }
# build the link as /m/{token}

# List the monitor's live links (each carries its token for re-copy)
curl https://app.example.com/api/v1/targets/$ID/shares

# Revoke one
curl -X DELETE https://app.example.com/api/v1/targets/$ID/shares/$SHARE_ID

The same actions are available from the monitor’s detail page in the UI. See the REST API for the endpoint contract.

Share links resolve on the operator app host, not on a per-tenant status subdomain. A monitor’s deletion cascades to its shares, so removing a monitor revokes every link to it.

Abuse

The surface is anonymous, so per-IP request throttling is handled at the reverse proxy. App-side, the live region is served from a short-lived shared cache and every data read inherits the same time-window and page-size limits as the operator API, bounding the cost of any single request.

Incident management

uptimepage turns a failing check into a first-class operational incident: a tracked lifecycle with acknowledgement, ownership, paging, on-call rotations, escalation, and a retrospective — not just a banner on a status page. This chapter is for operators running incident response. For the customer-facing surface it publishes to, see Public status page; for the wire-level endpoints see REST API.

The core idea: internal state is not public phase

The single most important distinction is that what your responders see is orthogonal to what your customers see. Conflating the two is the classic incident-tooling bug, so uptimepage keeps three independent axes on one incident:

AxisValuesAudienceChanged by
Internal statetriggeredacknowledgedresolvedRespondersAcknowledge / resolve / reopen actions
Public phaseinvestigating / identified / monitoring / resolved / postmortemCustomers on a status pageOperator-posted public updates only
Visibilityinternal / publicAn explicit publish action

Acknowledging an incident stops escalation and records who took it — it posts nothing to a status page. Customers see something only when you publish the incident and post a public update. An incident can run its whole internal lifecycle while staying internal.

How an incident opens

A background writer scans every enabled monitor (not only status-page components). When a monitor sustains a bad state — down, error, or degraded — it opens one incident; a sustained recovery to up resolves it automatically (with no human resolver recorded). One open incident per monitor at a time; duplicate failures fold into it.

Visibility is derived at open time: if the monitor is a component of an enabled status page the incident opens public, otherwise internal. A monitor on no page still gets a fully tracked internal incident.

You can also declare an incident by hand from the console (/incidents/declare) — for a problem a monitor can’t see, like a customer report or a partner outage. A manual incident may stand alone or link to a monitor, and opens internal.

Each incident carries a severity (minor / major / critical) and an urgency (high pages on-call, low notifies only). A declared incident takes the severity you choose; an auto-opened one currently defaults to major until an operator changes it.

The console

/incidents is the operator console — a management surface distinct from the dashboard’s at-a-glance banner. It lists incidents with severity, state, monitor, assignee, and age, filterable by state. /incidents/{id} is the detail view: header, the action bar, the trigger sample, and the activity log.

The action bar drives the lifecycle:

ActionEffect
Acknowledgestate = acknowledged, records the first acker, stops escalation. Re-acking keeps the original acker and time.
Resolvestate = resolved, records the resolver. (A sustained recovery auto-resolves with no resolver.)
ReopenA resolved incident returns to triggered and re-arms escalation.
Assign / unassignSet or clear the owning responder.
Add noteFree-text entry on the internal timeline.

Acknowledge and resolve prompt for an optional note so you can capture the why at the moment you act.

The activity log

Every lifecycle action writes an append-only event to the incident’s internal timeline. Each entry answers who, when, and what: the acting member’s email (system-driven transitions show system; an action taken through the MCP server is badged via MCP), an exact timestamp, and any note. This is the audit trail — the foundation for tracking response is a healthy habit of leaving notes, and the log makes that habit visible.

Paging and escalation

When an incident opens, the escalation engine pages the responsible channels. Paging reuses the existing Slack / Discord / Teams / Google Chat / Telegram (one-tap linked or bring-your-own bot) / WhatsApp / Webhook transports (see Configuration); email and SMS are not wired yet. Telegram rate-limit responses are honoured: a 429 with retry_after pushes the retry out at least that far.

An escalation policy is an ordered ladder of levels. Each level waits a delay, then pages its targets; if no one acknowledges, the engine advances to the next level, and can repeat the ladder a configured number of times before giving up. Acknowledging the incident halts the walk.

A policy’s targets can be:

  • a channel — pages that notification channel directly;
  • a user — pages the channels that member has chosen to be reached on (see on-call below);
  • a schedule — resolves who is on call right now and pages them.

Policies are owner-managed at /settings/escalation: build the ladder, set per-level targets, and pick an org-default policy. Bind a specific policy to a monitor from the monitor’s edit form. Resolution at page time is: the monitor’s own policy, else the org default, else simple mode — the monitor’s bound notification channels are paged directly, with no laddered re-paging.

One notification source. Every down/up notification flows through the incident engine — there is no separate per-monitor alert dispatch, so a monitor can never double-page. The escalation.enabled switch gates only the policy machinery (ladder walk, policy UI); with it off, monitors still page their bound channels in simple mode.

While an incident stays unacknowledged, the engine re-sends a reminder on the monitor’s renotify_interval_secs cadence (default hourly, 0 disables); acknowledging or resolving stops both the reminders and any escalation walk. Failed deliveries retry on exponential backoff and are dead-lettered after the attempt cap. Every attempt is auditable: the incident detail page has a Delivery section, and GET /api/v1/incidents/{id}/notifications returns the same log.

On-call schedules

On-call schedules (owner-managed at /settings/on-call) decide which human a user or schedule target pages.

A schedule has a timezone and one or more layers. Higher layers win when stacked. Within a layer, participants rotate in listed order on a cadence:

RotationHandoff
daily / weeklyHands off at the same wall-clock time each period, in the schedule’s timezone — stable across daylight-saving changes.
customA fixed number of seconds.

Overrides cover a specific window with a chosen person (vacations, swaps) and beat the rotation while active. The editor’s calendar builds one by clicking a start day, then an end day, then choosing who covers. A “who’s on call now” widget resolves the current responder, and GET /api/v1/on-call/who answers it programmatically.

Resolution at page time, for a given instant: an override covering that instant wins; otherwise the highest layer that has participants, advanced by its rotation. The result is a set of users.

Contact channels

A resolved user is paged through the org channels they have opted into — each member picks, on the on-call page, which notification channels reach them. A user/schedule target therefore resolves to people, then to their chosen channels; the paging log records the targeted user alongside the channel. If a member has chosen no channels, they resolve but cannot be paged.

Publishing to a status page

Internal incidents never reach customers. Publishing is the explicit gate.

Every public read — the status page, its JSON API, the RSS feed, and the history markers — filters on visibility = 'public', so an internal incident on a public-component monitor never leaks. Monitors that sit on an enabled status page open public automatically; everything else (manual incidents, monitors not on a page) stays internal until you publish.

From the incident detail page, publish flips visibility to public (optionally seeding a public title) and unpublish hides it again. A published incident appears on any status page whose components include its monitor. Narrate it for customers with public updates (the investigatingmonitoringresolved timeline); posting an update is separate from the internal state, exactly as the two-axis model intends.

Postmortems

A resolved incident can carry one postmortem — a retrospective with a summary, root cause, impact, and a list of action items (each with optional owner and a done flag). Write it from the incident detail page (write / edit postmortem).

Publishing a postmortem surfaces it on the public incident page: customers see the summary, root cause, impact, and the action-item text and done state. Internal detail — the action-item owner — is never exposed publicly. A draft stays private until you publish, and publish/unpublish are recorded on the incident’s activity timeline with the acting member, so the retrospective’s own history is auditable.

Metrics and reporting

/incidents/reports is a metrics dashboard over a trailing window (7 / 30 / 90 days):

  • MTTA — mean time to acknowledge (acknowledged_at − started_at).
  • MTTR — mean time to resolve (ended_at − started_at).
  • Total incidents, counts by severity and by state, auto-resolved vs human-resolved, and the noisiest monitors.

The same numbers are available to automation through the MCP get_incident_metrics tool.

MCP tools

An LLM connected through the MCP server can triage and operate incidents within its granted scopes: read the incident list and detail, read metrics, and — with write scope — acknowledge, resolve, and post public updates. Customer-supplied incident text is always returned as labelled data, never as instructions. See MCP server for the full tool table and scopes.

Auth and scopes

SurfaceRequirement
Incident lifecycle (ack / assign / resolve / note / publish / declare)incidents:write — any member; responders are not owners
Reading incidents and metricsincidents:read
Escalation policies + on-call schedules (config)oncall:write (owner-only); oncall:read to view

There is no incident-delete: incidents are resolved, never deleted, to keep the audit trail intact. Owner and member are the only roles — any member can be assigned, put on a schedule, paged, and can operate an incident; owners manage the escalation/on-call configuration.

Configuration

The [escalation] block (env prefix UPTIMEPAGE_ESCALATION__*) controls the engine:

KeyDefaultPurpose
enabledfalseEnable escalation policies (ladder walk + policy/on-call UI). Off, incidents still page the monitor’s bound channels directly (simple mode).
tick_interval_secs15How often the engine sweeps for due escalations and failed-page retries.
max_pages_per_tick500Backpressure cap on pages re-sent per sweep.
max_attempts5Give up paging a channel after this many failed attempts.

Per-org limits (max_escalation_policies, max_on_call_schedules, on_call_enabled) are plan quotas; see Quotas & rate limits.

Multi-tenancy

uptimepage runs as a multi-tenant SaaS from a single binary. The active org is always resolved from the authenticated session; there is no compile-time “self-host vs SaaS” mode and no ambient default org.

A single-tenant deployment is just a SaaS deployment where you sign up as the first user — the OAuth callback creates the user, an auto-provisioned org and the owner membership in one transaction. Teams who would rather skip the OAuth round-trip can seed users + organizations + memberships directly with a one-shot SQL script.

The org model

Three tables form the access-control core:

organizations ── memberships ── users
                     │
                     └── role: 'owner' | 'member'

Every tenant-scoped table (targets, incidents, incident_updates, maintenance_windows, maintenance_window_components, notification_channels, …) carries org_id NOT NULL and an ON DELETE CASCADE foreign key to organizations. ClickHouse check_results and check_results_1m are partitioned by (org_id, target_id, ts) so single-org queries never full-scan the table.

Slugs

Org slugs are case-insensitive (CITEXT), 3–30 characters, must start with a lowercase letter, and otherwise contain [a-z0-9-] only — no leading or trailing hyphen and no consecutive hyphens. A static reserved list (api, admin, login, …) is rejected at creation.

The placeholder slug a brand-new user’s first org gets at signup takes the shape {adj}-{noun}-{6char} from inline word lists in src/domain/word_lists.rs. The signup transaction returns Ok(None) on a slug collision so the caller wraps the generate-and-insert pair in a 5-attempt retry loop; the birthday-paradox tail above 5 retries is astronomically small. Users typically rename the slug after signup from settings; the org’s default status page is created with the same slug, which the owner can change independently in the page editor.

Three-org owner limit

A user can be owner of at most free_tier_owner_org_limit (default 3) active organisations. Enforced in a single SQL statement that puts the count subquery inside the INSERT … WHERE … so two concurrent creates cannot both win. Soft-deleted orgs do not count against the cap. Invited memberships (role member) are unlimited.

Soft delete and the 30-day purge

Deletion is two-phase to give operators a recovery window and to keep ClickHouse rows out of forever-orphan state.

  1. Soft delete. DELETE /api/v1/orgs/{id} flips organizations.deleted_at = now(). The org disappears from the user’s switcher and every URL referencing it returns 404 — is_active_member short-circuits on deleted_at IS NULL.
  2. Restore window. The original deleter can call POST /api/v1/orgs/{id}/restore within deletion_grace_period_days (default 30); the slug stays held to prevent squatting during this window.
  3. Purge. A daily job (src/jobs/retention.rs) runs at 03:00 UTC. It first runs the soft-delete purge (src/jobs/purge_deleted.rs::purge_tick):
    • Selects up to 10 orgs whose deleted_at is past the grace window.
    • Per org, in one PG transaction: insert into clickhouse_purge_queue (idempotent via ON CONFLICT (org_id) DO NOTHING), then DELETE FROM organizationsON DELETE CASCADE empties every tenant table.
    • Drains pending queue rows by issuing ALTER TABLE check_results DELETE WHERE org_id = ? against ClickHouse for each. The mutation is idempotent; a process restart between halves replays cleanly.
    • Then hard-deletes up to 10 soft-deleted users past the grace window that hold no live (unexpired, unused) recovery token. The users ON DELETE CASCADE erases memberships, oauth_identities, api_tokens, invitations, sessions and recovery tokens; rows referencing the user as an actor (login_attempts, org_audit_log, quota_events, plan_overrides) are kept with the actor nulled.

The same daily job then enforces long-horizon data retention from the [retention] config: it deletes login_attempts, quota_events and org_audit_log rows past their windows and reaps sessions that are absolute-expired or idle past auth.session.idle_timeout_days. ClickHouse check_results retention is the table’s own TTL (background merge), kept equal to retention.check_results_days. Short-cadence security sweeps (OAuth-state, magic-link) keep their own faster loops — their frequency is the property.

The outbox table is the load-bearing piece. A naive “DELETE in PG, then DELETE in CH” sequence leaves CH rows orphaned if the worker dies between calls — invisible to queries but on disk forever, breaking the “data fully erased within 30 days” privacy claim.

Per-org caches

AppState keeps tenant-derived caches keyed by OrgId so one tenant’s data cannot leak into another’s response:

CacheTypeTTL
dashboard_cachemoka::sync::Cache<OrgId, Arc<DashboardSummary>>5 s
public_status::cache::PageCachemoka::future::Cache<StatusPageId, Arc<PageData>>10 s
PageCache::last_goodmoka::sync::Cache<StatusPageId, Arc<PageData>>retained across inner’s TTL eviction for stale-fallback

The public-page caches are keyed by StatusPageId, not OrgId: an org can run several pages, each rendering a different subset of monitors, so the cache unit is the page. The underlying aggregator query still binds the org id, so a page only ever sees its own org’s data. PageCache::get_or_compute does per-page single-flight via moka’s try_get_with, so a thundering herd against one page doesn’t fan out into N expensive aggregator builds.

Public status routes gating

Public-status routing has two shapes, gated by tenancy.path_based_public_routes and tenancy.subdomain_public_routes. Path-based routing (/status, /api/public/v1/* on the operator host, scoped to the single live org) is the default and is correct only for a single-tenant deploy. Multi-tenant deployments must flip to subdomain routing ({slug}.{base_domain}) — otherwise every visitor sees the lone org’s data regardless of which slug they expected. The binary panics at boot on the dangerous combinations (subdomain routes with an empty base_domain, or a cookie_domain that overlaps the status wildcard); see Public status routing for the full flag matrix.

Tenant-isolation invariants

These are checked in CI:

  • Every runtime SQL statement against a tenant table must include org_id in its WHERE clause. Enforced by scripts/check_tenant_isolation.sh via an ast-grep rule. The only allow-listed call sites are src/storage/admin.rs (AdminRepo, cross-tenant by design) and src/storage/orgs.rs (operates on the organizations table itself), plus src/jobs/purge_deleted.rs (drains soft-deleted orgs and users across tenants).
  • Every ClickHouse SELECT … WHERE target_id = … must have a sibling org_id = ? term. Enforced by scripts/check_clickhouse_org_scope.sh.
  • A Postgres trigger on every child table (incident_updates, maintenance_window_components) raises on org_id mismatch between child and parent rows.
  • An integration test (tests/tenant_isolation_test.rs) provisions two orgs and asserts every per-org store backed by Postgres or ClickHouse only sees its own org’s rows.

If you add a new tenant-scoped table or a new repository, make sure both ast-grep rules cover it before merge.

Org-management API

See REST API for full schemas. The catalogue:

MethodPathPurpose
POST/api/v1/orgsCreate org (slug, name) — caller becomes owner
GET/api/v1/orgsList orgs the caller is a member of
GET/api/v1/orgs/{id}Get one org (member-only)
PATCH/api/v1/orgs/{id}Edit org (owner-only)
DELETE/api/v1/orgs/{id}Soft-delete (owner-only)
POST/api/v1/orgs/{id}/restoreRestore within the grace window (only by the deleter)
GET/api/v1/orgs/check-slug?slug=…Slug availability for signup forms
GET/api/v1/orgs/{id}/membersList members (owner-only)
DELETE/api/v1/orgs/{id}/members/{user_id}Remove a member (owner-only)
PATCH/api/v1/orgs/{id}/members/{user_id}Change a member’s role (owner-only; refuses to demote the last owner)
POST/api/v1/me/active-orgSwitch the session’s active org
GET/api/v1/me/orgsActive (non-deleted) orgs
GET/api/v1/me/deleted-orgsSoft-deleted orgs you deleted (restore UI)

Multi-region probes

Run checks from more than one location and keep every result attributed to the region that produced it. A single control plane owns all state (Postgres, ClickHouse, the web UI, alerting, and a scheduler for its own region); additional boxes run as stateless agents that pull their region’s monitor config and ship results back.

This is opt-in. A default deployment is a single region — the control plane checks everything itself and nothing below changes.

Model

  • Control plane — one process holding Postgres + ClickHouse + the web UI + alerting + a scheduler. Its own region is a normal region row identified by scheduler.region (default "default"); rename it to a real location, it is not a sentinel.
  • Agent — a process started with [agent] enabled = true. It runs no database, web UI, or alerting. It pulls its region’s decrypted monitor config from the control plane over authenticated HTTPS, runs the checks locally, and POSTs results back to the central ingest API. Agents never touch ClickHouse or fire alerts.
  • Region is the partition key. One agent per region needs no coordination — there is no leader election. (Running more than one agent in the same region, or more than one control plane, is out of scope for this version.)

New targets are assigned to scheduler.default_region (empty falls back to scheduler.region). At boot the control plane reconciles the configured region rows and backfills any unassigned target to the default region, so enabling regions never leaves a target unchecked.

Running an agent

On the agent box, point at the control plane and name the region. The token carries the agent’s capability — supply it by environment variable, never in a committed file:

[agent]
enabled = true
control_plane_url = "https://app.example.com"
region = "eu-west"
pull_interval_secs = 30
flush_interval_secs = 5
buffer_capacity = 10000
UPTIMEPAGE_AGENT__TOKEN=sm_agent_…   # the token minted by POST /operator/agents

The agent must reference a region and a token that already exist (see the operator surface below). Pull and ingest behaviour:

  • Pull (GET /api/agent/targets) — 401/403 is terminal: the agent clears its cached config and pauses, so revoking or disabling the agent stops the probe. 5xx/timeout is transient: it keeps serving the last-known config. Responses are content-hashed with an ETag, so a credential re-encrypt invalidates the cache even without a config change.
  • Ingest (POST /api/agent/results) — region and agent id are taken from the token, never trusted from the body. Rows that are clock-skewed or belong to a region the agent isn’t assigned are dropped per-row (the rest of the batch still lands) and counted, rather than rejecting the whole batch. Cross-process de-duplication is authoritative in ClickHouse; a re-sent identical batch is idempotent.

Operator surface

Regions and agents are managed instance-wide (across all tenants) under /operator/*, gated by a static bearer secret. Set it by environment variable; an empty value disables the surface entirely (it 404s, so it is invisible when off):

UPTIMEPAGE_OPERATOR__ADMIN_TOKEN=…
Authorization: Bearer <that-secret>
MethodPathPurpose
GET/operator/regionslist regions
POST/operator/regionscreate a region (id is a [a-z0-9-] slug, name, optional location)
PATCH/operator/regions/{id}rename / relocate, or enable / disable a region (enabled)
DELETE/operator/regions/{id}delete a region — 409 while it still holds agents or assigned targets
GET/operator/agentslist agents
POST/operator/agentsmint an agent — the response carries its sm_agent_… token once
PATCH/operator/agents/{id}rename / enable / disable an agent
DELETE/operator/agents/{id}delete an agent

The agent token is shown only at creation; store it when it is minted. Disabling an agent is immediately enforced on its next pull. There is no token-rotation endpoint yet — rotate by deleting and re-creating the agent.

Disabling a region stops it being scheduled and stops config-pull for it (its agents receive no targets) while keeping its stored history — a reversible alternative to deleting, which the foreign keys block while the region is in use.

A typical bring-up: create the region, mint an agent in it, copy the token to the agent box’s UPTIMEPAGE_AGENT__TOKEN, start the agent.

Viewing per-region data

Once results carry a region, the operator surfaces let you slice by it:

  • Dashboard — a region: filter in the subhead (shown only when the org spans more than one region) scopes every fleet metric to one region. ?region= is reflected in the URL.
  • Monitor detail — a region selector scopes the KPI cards, latency and breakdown charts, and recent results. In the all-regions view the latency chart overlays one p95 line per region, and a by region table summarises uptime, p50, p95, and last status per region. Pick a region to drill into a single line.
  • REST API/api/v1/targets/{id}/results, /latency, and /uptime accept an optional region= query parameter; /api/v1/targets/{id}/latency/by-region returns one series per region. GET /api/v1/regions lists the enabled region catalog and GET/PUT /api/v1/targets/{id}/regions read and set a monitor’s assignment — all under targets:read/targets:write. See REST API.

What deliberately blends across regions: the public status page’s component status (the public “is it up” answer is region-agnostic by design), the monitors list, and incident timelines. Those aggregate every region so a viewer sees one verdict.

Incident detection across regions

Detection evaluates each region’s recent run independently and then combines the verdicts, so one region’s transient network blip can’t corrupt the picture for a target probed from several places. There is always exactly one incident per target — its region is unset.

How the per-region verdicts combine is a per-monitor policy, set on the monitor form (default majority):

  • any — open as soon as a single region is sustained-unhealthy.
  • majority — open once more than half the regions agree it’s down (the standard defence against a single-location false positive).
  • all — open only when every region is down.
  • count: N — open once at least N regions are down.

A monitor probed from a single region behaves the same under every policy.

See Configuration for the [scheduler], [agent], and [operator] keys, and Architecture for where the pieces sit.

Authentication

uptimepage ships with an in-binary auth stack: GitHub OAuth for the operator UI, opaque per-user API tokens for the REST surface, and optional magic-link sign-in for users without a GitHub identity. The binary always runs as multi-tenant SaaS — single-tenant deployments are just SaaS with one signed-up user; see Multi-tenancy for the full model.

Concepts

  • User. A row in users, keyed by id. Email is CITEXT. A user can belong to multiple orgs.
  • Session. A 32-byte random id (43 base64url chars) stored in a HttpOnly; Secure; SameSite=Lax cookie, default _sm_session. Backed by a sessions row with idle + absolute timeouts.
  • API token. An opaque bearer token (sm_live_…) presented in the Authorization: Bearer … header. Stored as an argon2id hash plus a 16-char prefix for indexed lookup. Returned once at create time and never again.
  • Org. Container for the user-visible data (targets, incidents, maintenance, …). Memberships carry a role: Owner, Member.
  • Invitation. A pending row in invitations carrying an argon2id hash of a single-use token sent to a prospective member’s email.
  • Magic-link token. A single-use row in magic_link_tokens (auth.magic_link.expiry_minutes, default 15). Enabled by default; gated by auth.enabled_methods.

Flows

OAuth sign-in (GitHub, Google)

Both providers share one callback runner; only the upstream identity fetch differs. The callback is split into three strict phases:

  1. Phase ADELETE … RETURNING consumes the oauth_states row in one statement (provider-bound: a state minted for one provider cannot complete another’s callback). No upstream call has happened yet, so the DB connection is released before any HTTP.
  2. Phase B — exchange code for an access token, then fetch the profile: GitHub /user + /user/emails (verified primary only), Google OIDC userinfo (email accepted only with email_verified). No DB connection is held.
  3. Phase C — a fresh transaction materialises the user + identity, links a new provider to an existing account on verified-email match (restoring a soft-deleted account if needed), auto-creates a signup org if this is a new sign-up, and commits. The user’s default org (oldest active membership) is resolved after commit for the session row.

After commit, the previous session cookie (if any) is destroyed for session-fixation defence, a fresh session row is INSERTed, the cookie is set, and the user is redirected. Failure modes:

  • Invalid or expired state → 400 INVALID_STATE, logged to login_attempts.
  • User denied consent / provider sent no code → redirect back to /login, logged with failure_reason = "oauth_denied" (or "missing_code").
  • Upstream failure → 500, logged with failure_reason = "oauth_upstream_failed" (rows from before 2026-06 carry the old "github_upstream_failed").
  • Disabled (enabled_methods) or incompletely configured provider → 404 AUTH_METHOD_UNAVAILABLE on both start and callback; the listed-but-misconfigured case logs a warning.

API token auth

Bearer tokens skip the cookie path entirely. The middleware checks the Authorization: Bearer … header against the api_tokens table via the indexed token_prefix (first 16 chars of the raw token), then argon2-verifies the survivor. last_used_at is updated through the same 60-second debounce as session cookies.

CSRF protection does not apply: cross-origin browsers don’t auto-attach the Authorization header, so there is no forgery surface.

To manage resources with a token as code, see Terraform. To let an LLM client query and act on an org with a token, see the MCP server.

Scopes

Every token carries a set of resource:action scopes. A request is rejected with 403 INSUFFICIENT_SCOPE unless the token holds the scope its endpoint requires. full_access is a superset that grants all of them; unknown scope strings are ignored (forward-compatible).

Resourcereadwritedeleteexecute
targetslist / get / results / uptime / latency / incident historycreate / update / bulkdelete, bulk-deleterun a check now, test-probe a config
channelslist / getcreate / updatedeletesend a test notification
incidents— (target incident history is under targets:read; the public timeline needs no token)narrate / post update
maintenancelist / getcreate / updatedelete
status_pageread settingsupdate settings, upload logoremove logo

write implies read for the same resource. delete and execute are independent — they are not granted by write, so a config-management token (*:write) can change resources but cannot destroy them or trigger side effects. Grant delete/execute explicitly when you need them.

Org binding

A token is user-scoped, so each request names an org via the X-Uptimepage-Org: <slug> header. A token can additionally be bound to one org at creation:

  • Bound — the header is optional; if sent it must name the bound org, else 403 ORG_HEADER_MISMATCH. The token can never act on the user’s other orgs.
  • Unbound — the header is required (400 ORG_REQUIRED if absent). A malformed/unknown slug is 400 ORG_HEADER_INVALID on either kind.

Expiry

A token may carry an expiry (1–365 days); an expired token authenticates as invalid. Tokens without an expiry never lapse — prefer a bounded lifetime.

Managing tokens

Token management — create, list, rename, revoke — is browser-session only: these endpoints read the session cookie and reject bearer tokens, so a token can never mint another token (which would escape its own scopes) or reach account/org administration. Mint tokens in the UI at Settings → API tokens (a verified email is required).

Available only when auth.enabled_methods contains "magic_link":

  1. POST /auth/magic-link/request {email} — generates a 32-byte token, hashes it, INSERTs into magic_link_tokens with a 15-minute expiry, and emails the verify URL via the configured EmailSender. Anti-enumeration: the response is identical for known, unknown, and malformed emails — {"sent": true}.
  2. GET /auth/magic-link/verify?token=… — atomically marks the row used_at = now(), destroys any pre-login session, mints a new session (restoring a soft-deleted account — email ownership is the re-auth proof), auto-accepts a carried invitation, and redirects by priority: /?joined=<slug>/?invite=missed (carried invitation failed to redeem) → /?restored=1 (welcome-back banner) → carried redirect_after/. An invalid, used, or expired token renders an HTML “link expired” page with status 410 — one indistinguishable state, no JSON error envelope.

The schema and email template ship in v1 even when the flow is gated, so flipping the config doesn’t require a migration.

Invitations

Owners issue invitations to email addresses. The recipient gets emailed accept/decline links embedding the raw token (single-use, hashed at rest with the same argon2id parameters as API tokens).

  • GET /invitations/accept?token=… — with a session, redeems right there (clicking the emailed link is the consent; email must match); without one, bounces to /login?invitation=… and every sign-in method carries the invitation through and auto-accepts after login. The session’s active org rotates to the joined org and the dashboard shows a “welcome to ” banner (/?joined=<slug>). A carried invitation that can’t be redeemed (mismatched email, seat race, revoked) never breaks the login — the dashboard shows a generic “invitation couldn’t be applied” banner instead.
  • GET /invitations/decline?token=… — render-only confirm page (mail scanners prefetch links, so the GET never mutates); its button POSTs the decline.
  • A magic link requested for an unknown email that carries a valid invitation for that same address bootstraps the account at verify time: user created (verified, consent stamped, no personal org) and joined directly into the inviter’s org. Without a matching invitation, unknown emails still get the indistinguishable invalid-link page.
  • A seat-race loser’s invitation is un-consumed (accepted_at reverted), so “try again once a seat frees up” stays true.

Endpoints

MethodPathAuthDescription
GET/loginnoneLogin page (HTML)
GET/auth/github/loginnoneInitiate GitHub OAuth
GET/auth/github/callbacknoneHandle OAuth callback
POST/auth/logoutsessionDestroy current session
POST/auth/logout-allsessionDestroy all sessions for current user
POST/auth/magic-link/requestnoneRequest magic link (gated)
GET/auth/magic-link/verifynoneVerify magic-link token (gated)
GET/auth/google/loginnoneInitiate Google OAuth
GET/auth/google/callbacknoneHandle Google OAuth callback
GET/invitations/acceptoptional sessionEmailed accept link (HTML; redeems with session, else login bounce)
GET/invitations/declinenoneEmailed decline link (HTML confirm page; POST does the decline)
GET/api/v1/mesession/tokenCurrent user info
GET/api/v1/me/sessionssessionList active sessions
DELETE/api/v1/me/sessions/{id}sessionRevoke a session
GET/api/v1/me/api-tokenssessionList tokens (prefix only)
POST/api/v1/me/api-tokenssessionCreate token (returned once)
PATCH/api/v1/me/api-tokens/{id}sessionRename token
DELETE/api/v1/me/api-tokens/{id}sessionRevoke token
POST/api/v1/orgs/{org_id}/invitationssession, ownerIssue invitation
GET/api/v1/orgs/{org_id}/invitationssession, ownerList pending
DELETE/api/v1/orgs/{org_id}/invitations/{id}session, ownerRevoke
POST/api/v1/invitations/acceptsessionAccept (token in body)
POST/api/v1/invitations/declinenoneDecline (token in body)

Security model

  • CSRF. State-changing cookie-authenticated requests must carry X-Requested-With: uptimepage. Bearer requests skip. The header is comparison-checked in constant time via subtle::ConstantTimeEq.
  • Session fixation. Both the OAuth callback and the magic-link verify endpoint destroy any pre-existing session bound to the browser before minting the new one.
  • Hashed PII. IP addresses and User-Agent strings in sessions, login_attempts, and magic_link_tokens are stored as HMAC-SHA256(salt, value) — the salt lives in auth.fingerprint_salt / auth_salt_history. Rotating the salt refuses to boot without an explicit override env var to make audit-log breakage loud.
  • Argon2id parameters. Default parameters from the argon2 crate (Argon2::default()). Tokens carry 256 bits of entropy, so the factor of safety is in the token, not the params.
  • Anti-enumeration. Magic-link request and invitation lookup return the same response whether the underlying row exists.
  • Per-email send throttle. auth.magic_link.rate_limit_seconds (default 60) caps a single address to one outgoing email per window regardless of source IP. The check runs inside the spawned send task so it never branches the response path. Concurrent requests for the same address all still INSERT (preserving anti-enum work) but only the earliest row in the window — ordered by (created_at, id) — actually mails the user. Set to 0 to disable.

Background workers

  • oauth_state_cleanupDELETE FROM oauth_states WHERE expires_at < now() every 10 minutes.
  • invitations::purge_old — daily cleanup of accepted/declined/expired rows older than a configurable window.
  • magic_link_cleanup — every 6 hours when magic_link is in auth.enabled_methods. Drops expired rows and used rows older than 7 days (the forensic window for “was this token redeemed?”). When the method is disabled the routes 404 and no rows are ever inserted, so the ticker stays asleep.

Sign-in audit

Every authentication attempt — success or failure — writes a row to login_attempts:

  • method'github_oauth' | 'api_token' | 'magic_link'
  • success boolean
  • failure_reason text ('invalid_state', 'token_expired', 'invalid_token', …)
  • ip_hash, user_agent_hash for forensic correlation without storing raw PII

The “recent activity” panel on the user’s settings page reads from this table.

Deployment shape

Every authenticated request carries an active org id; data writes scope through repositories that enforce isolation. The cross-tenant test suite confirms a user can’t read or mutate another org’s rows via slug URL or session token. Single-tenant deployments work the same way — they just have one user and one org. See docs/multi-tenancy.md for the data model and isolation guarantees.

MCP server

uptimepage exposes a Model Context Protocol server so an LLM client — the claude.ai connector, Claude Desktop, an IDE, or MCP Inspector — can answer operational questions about one organization and take a few guarded actions, through typed, authorized, audited tools.

It is another authorized front door to the same stores the web app and /api/v1 use, not a bypass: tenant isolation, scopes, rate limits, and audit all apply. Every tool takes the org from the credential — never from a tool argument — so a connection can only ever see and touch its own org.

  • Transport — Streamable HTTP at POST/GET /mcp, served on its own host (mcp.{DOMAIN} in production).
  • Auth — an org-bound scoped API token (sm_live_…), minted either by hand (Settings → API tokens) or by the one-click OAuth 2.1 connector flow.
  • Surface — 7 read tools (always) + 4 write tools (each scope-gated, confirmed per action, and audited).

The server only mounts when enabled (see Enabling); a deployment that leaves it off never exposes /mcp.

Tools

All tools return typed structuredContent. Customer free text (monitor names, group names, tags, error messages, incident text) is returned as labelled data, never as instructions to the model — the server’s instructions tell the client to treat it that way.

Read tools

Side-effect-free (readOnlyHint). Require targets:read, status_page:read, or incidents:read.

ToolScopeReturns
get_org_healthtargets:readPer-state monitor totals + the worst currently-failing monitors, each with its open incident_id. The one-shot “what is broken right now?” answer — start here.
list_monitorstargets:readMonitors with optional state / type / tag filters, cursor-paginated; each item carries current state + last-checked time.
get_monitortargets:readOne monitor’s config, current state, last error, last HTTP status, and 24h / 30d uptime.
get_monitor_historytargets:readOne monitor’s history over a window (1h / 24h / 7d / 30d): uptime, latency series, failures with error text, incident windows.
list_incidentsincidents:readCurrently-open incidents on the org’s status pages: incident id, affected monitor, severity, latest update phase. Cursor-paginated.
get_incidentincidents:readOne incident: affected monitor, severity, open/resolved times, error sample, and the full operator-update timeline.
get_incident_metricsincidents:readIncident metrics over a trailing window (default 30 days): MTTA/MTTR, total, counts by severity and state, auto- vs human-resolved, and the noisiest monitors.
list_status_pagesstatus_page:readThe org’s status pages: slug, name, public URL, enabled. Cursor-paginated.
get_status_pagestatus_page:readOne status page with its components and each linked monitor’s current state.
get_org_usagetargets:readResource usage against plan limits (monitors, status pages, members, components) + key policy values.

A status-page monitor is down → get_org_health gives the incident_idget_incident shows the timeline → acknowledge_incident posts an update. Incidents (and the incident_id / ack workflow) exist only for monitors that are status-page components; a monitor not on any status page can be failing with incident_id: nullsince still reports how long it’s been down. run_check_now and get_monitor return http_status for HTTP monitors so you can tell “wrong status code” from “no response”.

Write tools

Not read-only. Each requires its scope and an interactive confirmation before it runs, and writes exactly one audit row for every outcome (success, declined, denied, error).

ToolScopeEffect
run_check_nowtargets:executeProbe a monitor immediately and record the result. A down result may fire the org’s normal alerts.
pause_monitortargets:writeStop a monitor’s checks until resumed. Idempotent.
resume_monitortargets:writeRestart a paused monitor’s checks. Idempotent.
acknowledge_incidentincidents:writePost an update to an incident; it appears on the public status page. Optional phase (investigating / identified / monitoring / resolved / postmortem, default investigating) and an explicit notify choice (no default).

Write scopes are never granted unless explicitly requested — the OAuth connector defaults to read-only (see Scopes).

Authentication

The /mcp endpoint is an OAuth 2.1 protected resource. It accepts an Authorization: Bearer sm_live_… token that must be:

  • a live scoped API token,
  • bound to one org (an unbound token is rejected — the connection has no org header to fall back on), held by a current member of that org,
  • carrying the scope each tool requires (else 403 insufficient_scope), and
  • when OAuth is configured, stamped with this endpoint as its audience (RFC 8707) — a token minted for a different audience is refused.

A request with no/invalid token gets 401 with a WWW-Authenticate: Bearer … header pointing at the resource metadata, which kicks off discovery for OAuth clients.

Two ways to get a token

1. By hand (manual connector). Mint an org-bound, read-only, expiring token in the UI (Settings → API tokens; a verified email is required) and paste it into the client. Grant the least scope you need — targets:read + status_page:read + incidents:read for the read tools. This is the simplest path for Claude Desktop / Inspector and needs only UPTIMEPAGE_MCP_ENABLED.

2. One-click OAuth (claude.ai connector). With UPTIMEPAGE_MCP_OAUTH_ENABLED on, the client discovers the authorization server, you log in with your existing session and approve a consent screen, and the server mints the same org-bound expiring token behind the scenes — no copy-paste. This is the only path that mints write scopes, and only when the consent screen’s opt-in boxes are checked.

Why OAuth at all?

The manual path works but pushes a long-lived bearer token through copy-paste and client config. OAuth replaces that with a browser consent: the user authenticates against the existing login, the connector receives a short-lived access token plus a rotating refresh token, and the connection lifetime (refresh-token lifetime) is the user’s explicit choice on the consent screen (default 90 days, max 365 — there is deliberately no “never”). Reused refresh tokens revoke the whole family. The connector never sees the user’s password and the access token is bound to this one resource.

OAuth endpoints

Discovery + authorization-server endpoints live on the app host (where the session cookie lives); the protected resource is /mcp on its own host.

EndpointHostPurpose
/.well-known/oauth-protected-resourceresource (mcp.)RFC 9728 resource metadata (resource id, authorization servers, scopes)
/.well-known/oauth-authorization-serverappRFC 8414 AS metadata (PKCE S256 only, public clients, code + refresh grants)
/oauth/registerappRFC 7591 Dynamic Client Registration
/oauth/authorizeappLogin + consent screen (PKCE S256, RFC 8707 resource)
/oauth/tokenappIssue / refresh the audience-bound token

Redirect URIs are restricted to HTTPS hosts (web connectors) and loopback HTTP (local tooling); custom schemes, non-loopback cleartext, userinfo, and fragments are rejected at registration.

GET /oauth/authorize renders the consent screen — the one page the user sees during the OAuth flow. It appears after login, once the client + redirect URI are validated, and only when mcp.oauth_enabled is on. Approving here is what mints the token; nothing is granted until the user clicks Approve.

It shows:

  • Who and what — the client name and the single org it’s connecting to. Access is always scoped to that one org.
  • Granted abilities — one line per scope, in plain language (e.g. “Read your monitors and their current status”, “Pause and resume your monitors”). Write abilities are flagged with a ⚠ marker, and a warning banner appears at the top stating the connection can make changes — each of which still asks for per-action confirmation.
  • Connection expires — a picker (30 / 60 / 90 / 365 days, default 90) that sets the refresh-token (connection) lifetime. There is no “never”.
  • Approve / Deny — Deny aborts the flow; Approve mints the org-bound scoped token and returns the user to the client.

A read-only request shows “wants read-only access” with no warning banner; a request that includes any write scope switches to the “is requesting access” wording plus the banner and ⚠ markers.

Scopes

The connector advertises six grantable scopes. A request with no scope (or only unknown scopes) grants the read-only default; write scopes are opt-in.

ScopeGrantsIn default set?
targets:readall read tools over monitors
status_page:readstatus-page read tools
incidents:readlist_incidents, get_incident
targets:writepause_monitor, resume_monitoropt-in
targets:executerun_check_nowopt-in
incidents:writeacknowledge_incidentopt-in

A granted write scope is necessary but not sufficient — every write tool still asks the user to confirm the specific action at call time.

Confirmations

Before any write tool acts, the server sends an MCP elicitation request describing the exact action (the monitor’s name, the effect, and — for acknowledge_incident — the message and notify choice). The tool proceeds only on an explicit approval; a decline, a dismissal, or a client that can’t elicit all fail closed with not_confirmed. There is no “remember my choice” — each action is confirmed on its own.

Audit

Every write-tool invocation writes one row to mcp_audit, on every path — success, user-declined, scope-denied, bad input, not-found, or server error — recording: actor_type = mcp, the token id, the acting user + org, the tool name, the arguments, the outcome (success / denied / error), and a short detail code. The same event is emitted to tracing. Reads are not audit-logged (they’re side-effect-free and already rate-limited).

Enabling

Off by default. Config keys (TOML under [mcp], or env with the UPTIMEPAGE_ prefix and __ nested separator):

KeyEnvDefaultPurpose
mcp.enabledUPTIMEPAGE_MCP__ENABLEDfalseMount /mcp + the read/write tools.
mcp.oauth_enabledUPTIMEPAGE_MCP__OAUTH_ENABLEDfalseAdd the OAuth 2.1 endpoints that back the one-click connector.
mcp.resource_uriUPTIMEPAGE_MCP__RESOURCE_URIemptyCanonical absolute URI of /mcp — the OAuth resource id + RFC 8707 audience, e.g. https://mcp.uptimepage.dev/mcp. Empty disables audience binding (static-token mode).
mcp.allowed_originsUPTIMEPAGE_MCP__ALLOWED_ORIGINSemptyRFC 6454 Origin allow-list (DNS-rebinding defense). Empty disables the check; a missing Origin header always passes (non-browser clients send none).
mcp.access_token_ttl_secsUPTIMEPAGE_MCP__ACCESS_TOKEN_TTL_SECS3600Access-token lifetime (short; auto-renewed via the rotating refresh token).

When OAuth is on, the app refuses to boot unless mcp.resource_uri and auth.public_base_url are real HTTPS origins — the issuer and audience must be well-formed. Migrations 016 (OAuth) + 017 (audit) must be applied.

Production (GitHub-managed)

The deploy pipeline upserts the two switches from repo variables (Settings → Secrets and variables → Actions → Variables):

  • MCP_ENABLED=true
  • MCP_OAUTH_ENABLED=true

deploy.yml writes the corresponding UPTIMEPAGE_MCP_* keys into the server .env on each deploy. The resource URI defaults to https://mcp.{UPTIMEPAGE_DOMAIN}/mcp; mcp.{DOMAIN} rides the existing *.{DOMAIN} wildcard cert + Caddy route (no new DNS). See deployment/.env.example and Deployment.

Connecting a client

claude.ai connector (OAuth)

Settings → Connectors → Add custom connector → URL https://mcp.{DOMAIN}/mcp → Connect. You’ll be sent to the login + consent screen; approve, and the tools appear. This exercises the full OAuth path and is the recommended end-user flow.

Claude Desktop / IDE (manual token via mcp-remote)

mcp-remote bridges a local stdio client to the remote Streamable HTTP endpoint. Add to your client config:

{
  "mcpServers": {
    "uptimepage": {
      "command": "npx",
      "args": [
        "-y", "mcp-remote",
        "https://mcp.uptimepage.dev/mcp",
        "--header", "Authorization: Bearer sm_live_YOUR_TOKEN"
      ]
    }
  }
}

For a local dev server over plain HTTP, add --allow-http to the args.

MCP Inspector (testing)

npx @modelcontextprotocol/inspector

Set transport Streamable HTTP, URL https://mcp.uptimepage.dev/mcp, and an Authorization: Bearer sm_live_… header. Inspector lists every tool with its schema and lets you exercise the elicitation approve/deny flow.

Examples

Raw protocol (curl)

The transport is JSON-RPC over Streamable HTTP. initialize returns a session id the client echoes on later calls.

# initialize → 200 + Mcp-Session-Id response header
curl -sD- https://mcp.uptimepage.dev/mcp \
  -H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize",
       "params":{"protocolVersion":"2025-11-25",
                 "capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}'

# list tools (reuse the session id from the initialize response)
curl -s https://mcp.uptimepage.dev/mcp \
  -H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
  -H 'Mcp-Session-Id: THE_SESSION_ID' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'

# call a tool: open incidents on your status pages
curl -s https://mcp.uptimepage.dev/mcp \
  -H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
  -H 'Mcp-Session-Id: THE_SESSION_ID' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":3,"method":"tools/call",
       "params":{"name":"list_incidents","arguments":{}}}'

# read one incident's timeline (id from list_incidents or get_org_health)
curl -s https://mcp.uptimepage.dev/mcp \
  -H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
  -H 'Mcp-Session-Id: THE_SESSION_ID' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":4,"method":"tools/call",
       "params":{"name":"get_incident","arguments":{"id":"INCIDENT_ID"}}}'

Write tools (acknowledge_incident, pause_monitor, …) follow the same tools/call shape but the client must support elicitation — curl can’t approve the confirmation, so they’re driven from a real MCP client.

A missing/invalid token returns 401 with WWW-Authenticate: Bearer …; a wrong Host returns 403; a missing MCP-Protocol-Version on a non-initialize call returns 400; notifications get 202.

Asking an LLM

Once connected, drive it in natural language — the client picks the tool:

  • “What’s broken in my org right now?” → get_org_health
  • “Show me every DNS monitor that’s degraded.” → list_monitors(type=dns, state=degraded)
  • “How has the checkout API done over the last 7 days?” → get_monitor_history(window=7d)
  • “What incidents are open, and what’s been posted on them?” → list_incidentsget_incident
  • “Acknowledge the payments incident — we’re investigating.” → acknowledge_incident(phase=investigating) (asks you to confirm)
  • “Am I near any plan limits?” → get_org_usage
  • “Run a check on the payments monitor now.” → run_check_now (asks you to confirm; may alert)
  • “Pause the staging monitor.” → pause_monitor (asks you to confirm)

Security model

  • Org isolation. Org comes from the token, never an argument; the token must be org-bound and the holder a live member. The cross-tenant guarantees in Multi-tenancy apply unchanged.
  • Least privilege. Read-only by default; write scopes are opt-in and each write is separately confirmed and audited.
  • Audience binding. With OAuth on, tokens are pinned to this /mcp resource (RFC 8707), so a token leaked from elsewhere can’t be replayed here.
  • DNS-rebinding defense. The transport enforces a Host allow-list (the configured resource host) and an optional Origin allow-list.
  • Prompt-injection posture. Customer-supplied text is returned as labelled data and the server instructions tell the client not to treat it as commands — but the ultimate guard is that the dangerous tools are scope-gated and human-confirmed.

Configuration

Defaults live in config/default.toml. Every key can be overridden by an environment variable using the prefix UPTIMEPAGE_ and __ as the nested separator.

Example: UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080

Override UPTIMEPAGE_CONFIG_PATH to point at an alternate base config file.

Sections

SectionKeyPurpose
serverapi_bind, metrics_bindbind addresses for REST API and Prometheus exporter
runtimeworker_threads, max_blocking_threadsTokio runtime sizing (0 = num_cpus)
checkermax_concurrent_checksglobal concurrency cap enforced by worker pool semaphore
checkerdefault_timeout_ms, connect_timeout_msclient-side timeouts applied to outbound checks
checkerdefault_check_interval_secsfallback interval when target spec omits it
checkerper_host_max_inflight, rdap_max_inflightper-(org, host, port) and per-TLD RDAP concurrency caps. Fail-fast bulkhead — over-cap checks return a degraded result instead of queueing
http_clienttcp_keepalive_secs, user_agentper-check connection keep-alive (one request’s lifetime — checks connect fresh, no pool) and the outbound User-Agent
dnscache_size, positive_ttl_secs, negative_ttl_secs, servershickory resolver — point at internal resolvers when needed
securityallow_private_targetsSSRF guard: when false (default) any target resolving to loopback / private / link-local / reserved IPs is rejected
securitycredentials_kek_base6432-byte base64 key encrypting basic_auth / bearer_token at rest. Empty (default) stores plaintext — dev only
circuit_breakerfailure_threshold, success_threshold, open_duration_secs, half_open_max_callsper-host breaker state machine
storage.postgresurl, max_connections, min_connections, acquire_timeout_secstarget metadata store
storage.clickhouseurl, database, user, password, batch_size, batch_timeout_ms, buffer_sizeresult sink and pipeline back-pressure
schedulertarget_refresh_interval_secs, jitter_pcthow often the registry is reconciled against Postgres, and how much jitter is applied to each target’s tick
schedulerregion, default_regionthis control plane’s own region id (a normal region row, default "default") and the region new targets are assigned to (empty falls back to region). See Multi-region probes
agentenabled, control_plane_url, region, pull_interval_secs, flush_interval_secs, buffer_capacityrun this process as a stateless regional probe instead of a control plane. token is env-only (UPTIMEPAGE_AGENT__TOKEN). Off by default. See Multi-region probes
operatoradmin_tokenstatic bearer secret for the instance-admin /operator/* surface (regions + agents). Env-only (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN); empty disables the surface (404s)
observabilitylog_level, log_formattracing-subscriber filter + JSON vs pretty output
observabilitymetrics_enabled, gauge_sample_interval_msPrometheus exporter toggle and sampler cadence
observabilitytracing_enabledMaster on/off for OTLP trace export. Export is active only when this and observability.grafana.enabled are true
observability.grafanaenabled, otlp_endpoint, instance_id, api_key, trace_sample_ratioOTLP/HTTP trace export to Grafana Cloud / any OTLP collector. api_key is env-only. See Trace export below
api.rate_limitenabled, per_second, burstper-IP token-bucket rate limiter on /api/v1/*. Disabled by default
api.corsenabled, allowed_origins, allowed_methods, allow_any_originbrowser CORS for /api/v1/*. Disabled by default. Wildcard only via allow_any_origin = true
notification channelsNot a config block. Channels are per-org runtime resources managed via the /api/v1/notification-channels API; secrets are sealed at rest with the credentials KEK
tenancypath_based_public_routes, subdomain_public_routes, free_tier_owner_org_limit, deletion_grace_period_daysPublic-status routing shape + org limits. See Public status routing below and docs/multi-tenancy.md for the full model
retentioncheck_results_days, login_attempts_days, quota_events_days, audit_log_daysLong-horizon data-retention windows for the daily 03:00-UTC purge job. Every key is bound by the job — no decorative knobs
public_statusbase_domain, cache_max_orgs, cache_ttl_secs, last_good_ttl_secs, logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px, default_brand_color, default_show_powered_by, public_per_ip_rate_limit_per_minPer-org public status pages at {slug}.{base_domain}. See Public status page below and Per-org status pages
authenabled_methods, fingerprint_salt, public_base_urlSign-in methods, HMAC salt for IP/UA hashes, base URL embedded in invitation + magic-link emails. See Auth configuration below
auth.sessionidle_timeout_days, absolute_timeout_days, cookie_name, cookie_secure, cookie_domain, renew_on_useSession cookie shape + lifetime. cookie_secure = true in production
auth.githubclient_id, client_secret, redirect_url, scopesGitHub OAuth client. The button renders on /login only when client_id, client_secret, and redirect_url are all set
auth.googleclient_id, client_secret, redirect_url, scopesGoogle OAuth client, same gating as auth.github. Email is trusted only with Google’s email_verified attestation
auth.api_tokensmax_per_user, prefix_visible_charsCap per user, indexed prefix length for token lookup
auth.invitationsexpiry_hours, max_pending_per_orgInvitation lifetime and per-org pending cap
auth.magic_linkexpiry_minutes, rate_limit_secondsMagic-link token lifetime. Routes only mount when enabled_methods includes "magic_link"
mcpenabled, oauth_enabled, resource_uri, allowed_origins, access_token_ttl_secsLLM connector (MCP) server at /mcp. Off by default; OAuth requires real HTTPS resource_uri + auth.public_base_url. See MCP server
emailprovider, from_name, from_addressTransactional email backend. provider"resend" | "log" | "memory"
email.resendapi_key, webhook_secretapi_key required when email.provider = "resend". A set webhook_secret (the endpoint’s Svix whsec_… signing secret) mounts POST /hooks/resend: a permanently bounced or spam-complaining address gets every email channel pointed at it disabled, with the reason shown on the channel form
whatsapp_appenabled, access_token, phone_number_id, public_number, app_secret, verify_token, template_name, language_codeOperator WhatsApp number behind one-tap whatsapp_app channels (wa.me deep link + /hooks/whatsapp Meta webhook). enabled = true AND complete creds mount the surface — the flag is a deliberate spend gate, since alert sends are operator-paid Meta template messages. Inbound stop disables the sender’s channels

Public status routing

uptimepage ships from one binary as a multi-tenant SaaS. The active org is always resolved from the authenticated session; there is no ambient “default org” and no compile-time self-host mode. A single-tenant deployment is just a SaaS deployment where you sign up as the first user (or seed users + organizations + memberships via a SQL one-shot).

The public status surface is gated by two independent flags because path-based and subdomain routing have opposite safety profiles:

  • tenancy.path_based_public_routes — serve /status and /api/public/v1/* on the operator host, scoped to the single live org. Useful for a single-tenant deploy (one org, one page). Defaults to true. Must be set to false once you have more than one tenant — otherwise every visitor sees the lone org’s data regardless of which slug they expected.
  • tenancy.subdomain_public_routes — serve one page per org at {slug}.{public_status.base_domain} (apex wildcard). Defaults to false; requires a well-formed base_domain.
ShapeRecommended flagsPublic surface
Single-tenantpath_based_public_routes = true (default)/status on the operator host (one org)
Multi-tenant SaaSsubdomain_public_routes = true, path_based_public_routes = false{slug}.{base_domain} per org

The binary refuses to boot in the dangerous combinations: subdomain_public_routes with an empty or single-label public_status.base_domain; or an auth.session.cookie_domain that overlaps the status wildcard. Each is a loud panic at startup, not a silent runtime leak. See Per-org status pages for the full model.

Org limits and the purge worker

  • free_tier_owner_org_limit (default 3) caps how many orgs a single user can own. Soft-deleted orgs don’t count. Enforced inside the membership INSERT so concurrent creates can’t exceed the cap.
  • deletion_grace_period_days (default 30) is how long a soft-deleted org’s slug is held and how long the original deleter has to restore it.
  • The soft-delete purge now runs inside the daily retention job (src/jobs/retention.rs) at a fixed 03:00 UTC, not on a configurable interval. Each run cascades up to 10 past-grace orgs, drains any pending entries from clickhouse_purge_queue (the outbox between PG cascade and ClickHouse ALTER TABLE DELETE), hard-purges past-grace users, then enforces the [retention] windows. See Soft delete and the 30-day purge for the full implementation and failure-recovery guarantees.

The [retention] section sets the long-horizon windows. Defaults: login_attempts_days = 180, quota_events_days = 90, audit_log_days = 730. Check-result retention is not a config knob — the physical TTLs are baked into the ClickHouse tables at migration time (a value here would be silently ignored, since the TTL is never re-issued as an ALTER on boot): raw per-check rows in check_results keep 90 days, and the hourly rollup check_results_1h keeps 13 months. Those are the widest-tier ceilings; what a given plan actually sees is narrowed at read time by a per-plan window clamp (separate windows for raw forensics and chart history), so a plan change is an instant tag flip with no data rewrite. The public status page’s daily history strip still shows 90 days, and the Privacy Policy’s retention table pins these same physical windows. Session idle/absolute reaping uses [auth.session]; soft-deleted org/user grace uses tenancy.deletion_grace_period_days; OAuth-state and magic-link tokens are swept by their own short-cadence jobs.

See Multi-tenancy for the full model, slug rules, and the storage-layer isolation invariants the CI checks enforce.

Auth configuration

[auth]
enabled_methods = ["github_oauth", "google_oauth", "magic_link"]
fingerprint_salt = ""                # HMAC salt for IP/UA hashes; rotate-aware
public_base_url = "https://status.example.test"

[auth.session]
idle_timeout_days = 30
absolute_timeout_days = 90
cookie_name = "_sm_session"
cookie_secure = true                 # set false only for plain-HTTP local dev
cookie_domain = ""                   # empty = host-only cookie
renew_on_use = true

[auth.github]
client_id = ""                       # from https://github.com/settings/developers
client_secret = ""
redirect_url = "https://status.example.test/auth/github/callback"
scopes = ["user:email", "read:user"]

[auth.google]
client_id = ""                       # Google Cloud Console OAuth web client
client_secret = ""
redirect_url = "https://status.example.test/auth/google/callback"
scopes = ["openid", "email", "profile"]

[auth.invitations]
expiry_hours = 168                   # 7 days
max_pending_per_org = 50

[auth.api_tokens]
max_per_user = 25
prefix_visible_chars = 16            # floor; lower values fail boot

[auth.magic_link]
expiry_minutes = 15
rate_limit_seconds = 60                # per-email send throttle; 0 disables

[email]
provider = "log"                     # "resend" in prod, "log" in dev, "memory" in tests
from_name = "Uptimepage"
from_address = "no-reply@example.test"

[email.resend]
api_key = ""                         # required when provider = "resend"
webhook_secret = ""                  # whsec_… of the Resend webhook endpoint

[whatsapp_app]                       # operator WhatsApp number (one-tap linking)
enabled = false                      # deliberate spend gate — creds alone stay off
access_token = ""                    # Meta Cloud API token (env-only)
phone_number_id = ""                 # Cloud API sender id
public_number = ""                   # display number digits — the wa.me target
app_secret = ""                      # signs webhook deliveries (env-only)
verify_token = ""                    # echoed by Meta's GET subscribe handshake
template_name = ""                   # approved alert template, single body param
language_code = "en"

auth.enabled_methods is the policy switch per sign-in method: removing an entry disables that method’s login start/callback (404) and hides its button. OAuth providers additionally need client_id + client_secret + redirect_url set — a listed but incompletely configured provider stays hidden and logs a warning on probe. "magic_link" mounts the magic-link request/verify endpoints and the login-page email form.

auth.fingerprint_salt is paired with the auth_salt_history table. Rotating the value mid-deployment refuses to boot unless the override env var documented in docs/troubleshooting.md is set — this is deliberate so audit-trail breakage is loud.

Central Telegram bot

[telegram]
bot_token = ""            # env UPTIMEPAGE_TELEGRAM__BOT_TOKEN; presence enables the feature
bot_username = ""         # verified against the Bot API at boot; used for t.me deep links
webhook_secret = ""       # random, 32+ chars; Telegram echoes it on every webhook delivery

Setting bot_token switches on one-tap Telegram channel linking: the type card in the channel form, the link-code API, and the /hooks/telegram receiver. Empty token (the default) leaves the feature absent entirely — self-host deployments keep the bring-your-own telegram transport, which needs no operator config.

When enabled, boot validates the trio: non-empty bot_username, webhook_secret of 32+ characters, and an https:// auth.public_base_url (Telegram only delivers webhooks to public https endpoints). The app then verifies the token against the Bot API and registers the webhook on every boot; a Telegram outage logs a warning and disables the bot for that boot instead of failing the deploy.

All three values are operator secrets: env-only in production, never in a committed config file.

Provider OAuth connect (“Add to Slack” / “Add to Discord”)

[slack_oauth]
client_id = ""            # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_ID
client_secret = ""        # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_SECRET

[discord_oauth]
client_id = ""            # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_ID
client_secret = ""        # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_SECRET

Credentials of operator-owned OAuth apps — Slack with the incoming-webhook scope, Discord with webhook.incoming. When a pair is set, that provider’s panel in the channel form grows a connect button (plus a QR variant): the provider’s consent screen picks the destination channel and the callback stores the returned webhook as a regular slack/discord channel — access tokens are discarded. The app’s redirect URL must be <auth.public_base_url>/auth/slack/callback (or …/auth/discord/callback). Empty credentials (the default) hide the button; manual webhook paste always works. Env-only in production, never in a committed config file.

Public status page

The [public_status] block configures the per-org public surface. It is load-bearing only when tenancy.subdomain_public_routes = true; the defaults are safe to leave untouched for self-host.

[public_status]
base_domain = ""                       # REQUIRED when subdomain_public_routes = true
cache_max_orgs = 1000                  # hot + last-good cache bound
cache_ttl_secs = 10                    # per-org rendered-page TTL
last_good_ttl_secs = 3600              # idle eviction for the stale-fallback layer
logo_dir = "/var/lib/uptimepage/logos"
max_logo_size_bytes = 1048576          # 1 MiB byte ceiling (pre-decode)
allowed_logo_mime_types = ["image/png", "image/jpeg", "image/webp"]
max_logo_dimension_px = 1200           # larger uploads are downscaled; decode
                                       # is also allocation-bounded (bomb guard)
default_brand_color = "#3b82f6"        # used when an org sets no colour
default_show_powered_by = true
public_per_ip_rate_limit_per_min = 60  # in-app limit behind the Caddy-side one
KeyPurpose
base_domainparent domain for {slug}.{base_domain}. Must be multi-label; boot fails on empty/single-label when subdomain routing is on
cache_max_orgs / cache_ttl_secsper-org page cache size and freshness window
last_good_ttl_secshow long an idle org’s last-known-good snapshot is retained before eviction
logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_pxlogo upload storage and limits
default_brand_color, default_show_powered_byfallbacks when an org leaves branding unset
public_per_ip_rate_limit_per_minsecond-layer rate limit behind the reverse proxy’s

History-strip length (90 days) and the recent-incidents horizon (30 days) remain hard-coded defaults in src/public_status/aggregator.rs. What a page publishes is curated per-page — a monitor appears as a component only while it’s bound to that page, and its presentation lives on the binding:

Per-page component fieldPurpose
(binding exists)the monitor is published as a component on that page
public_namedisplay name (falls back to operator-side monitor name)
public_descriptionoptional one-liner
public_groupoptional group label; ungrouped components render last
sort_orderASC integer sort within a group

See Public status page for the operator workflow and Per-org status pages for the SaaS subdomain model.

Trace export

OpenTelemetry spans are exported over OTLP/HTTP (protobuf) when both observability.tracing_enabled and observability.grafana.enabled are true. Disabled by default and zero-cost when off.

[observability]
tracing_enabled = false                # master on/off for trace export

[observability.grafana]
enabled = false                        # second switch; both must be true
otlp_endpoint = ""                     # OTLP base, no /v1/traces suffix; e.g.
                                       # https://otlp-gateway-<zone>.grafana.net/otlp
instance_id = ""                       # Grafana Cloud numeric instance / stack id
trace_sample_ratio = 0.05              # parent-based head sampling, [0.0, 1.0]
# api_key                              # NEVER in TOML — env var only (below)
KeyPurpose
tracing_enabledmaster switch; with grafana.enabled gates all export
grafana.enabledsecond switch (kept separate so the block is inert until explicitly turned on)
grafana.otlp_endpointOTLP/HTTP base URL; the service appends /v1/traces (a value already ending in it is left as-is). Empty fails boot when export is on
grafana.instance_idbasic-auth username (Grafana Cloud instance id). Empty fails boot when export is on
grafana.api_keybasic-auth password. Env-only: UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY. Never read from a config file; redacted in any serialised config
grafana.trace_sample_ratiohead sampling ratio under a parent-based sampler. Must be in [0.0, 1.0] or boot fails

Auth is Authorization: Basic base64(instance_id:api_key). Resource attributes service.name = uptimepage and service.version are attached. The batch exporter is flushed and stopped on graceful shutdown. A transport build failure logs a warning and the service continues without traces — telemetry never takes down monitoring. Inconsistent settings (export on with a missing endpoint / instance / key, or an out-of-range ratio) are a clean startup config error.

Tuning notes

  • max_concurrent_checks caps simultaneous in-flight checks. Per-check memory is small (a tokio task plus an in-flight hyper request), so the practical ceiling is set by file descriptors and ephemeral ports rather than RAM.
  • per_host_max_inflight (default 2) is the per-tenant per-(host, port) in-flight cap. One tenant fanning a burst of checks at the same upstream looks like a probe; this cap keeps that fingerprint flat. Tenant-scoped — one customer’s burst never starves another customer’s monitor of the same host. Fail-fast: a check that would exceed the cap is recorded as degraded with error="throttled: host concurrency cap" and skipped (no alert fired — the upstream is fine, the back-pressure is operator-side). Counters: uptimepage_host_throttle_waits_total{kind="host"} (attempts) and uptimepage_host_throttle_drops_total (rejections).
  • rdap_max_inflight (default 1) is the process-wide per-TLD RDAP concurrency cap (across all tenants). Daily check cadence + per-TLD slot means deep queues drain quickly without bursting any registry. Same fail-fast behavior + counters as the per-host cap.
  • storage.clickhouse.buffer_size is the mpsc capacity between worker pool and batcher. Sized for ~1 s of bursts at peak RPS. Drops increment storage_dropped_total{reason="queue_full"} — that metric is your back-pressure signal.
  • storage.clickhouse.batch_size vs batch_timeout_ms trade tail latency for throughput. 1000 / 500ms is a good starting point at ~20k rps.
  • scheduler.jitter_pct prevents synchronized fleet-wide ticks. Default 10% is enough to spread N targets across an interval without making individual schedules unpredictable.
  • dns.servers accepts either bare IPs ("1.1.1.1") or ip:port form. Used as is — no system resolver fallback.
  • security.allow_private_targets is the SSRF guard. Default false blocks:
    • Loopback (127.0.0.0/8, ::1)
    • RFC1918 private (10/8, 172.16/12, 192.168/16)
    • Link-local (169.254/16, fe80::/10) — covers AWS/GCP metadata 169.254.169.254
    • Carrier-grade NAT (100.64/10)
    • IPv6 ULA (fc00::/7), discard, IPv4-mapped private, documentation ranges
    • Multicast, broadcast, unspecified, reserved-for-future-use
    • IPv6 transition mechanisms: 2002::/16 (6to4) and 64:ff9b::/96 (NAT64) are decoded to their embedded IPv4 and rejected when the inner IPv4 falls in any blocked range The guard runs both at API submission (rejects IP-literal URLs synchronously) and after DNS resolution at connect time (catches DNS rebinding). Flip to true for internal monitoring where private targets are the goal — operators are then responsible for network segmentation.
  • security.credentials_kek_base64 enables AES-256-GCM encryption of HTTP basic_auth and bearer_token values inside the targets.check_spec JSONB column. Generate with openssl rand -base64 32. Each write produces a fresh 12-byte random nonce; the on-disk shape is {"$enc":"v1:<nonce>:<ciphertext>"}. When the key is unset the service logs a startup warning and stores credentials plaintext (dev-friendly upgrade path — existing plaintext rows continue to read after a key is provisioned). Rotation and KMS integration are out of scope for the current version; treat the KEK as long-lived and protect it via your secret-management of choice (env file with restricted mode, container secret, etc.). A malformed KEK fails the process at startup.
  • api.rate_limit applies a per-peer-IP token bucket only to /api/v1/* routes (/healthz and /readyz are excluded so liveness probes never see 429). per_second is the refill rate; burst is the bucket capacity. Excess requests get 429 Too Many Requests with a Retry-After header. The bucket key is the TCP peer IP — when the service sits behind a reverse proxy, every client appears as the proxy IP, so prefer doing rate limiting at the proxy in that topology. Disabled by default; leave it off and let your reverse proxy enforce limits unless you bind the API directly to the internet.
  • TLS cert checks (type = "tls_cert") open a dedicated TCP+TLS handshake per probe — separate from the HTTP check path. Recommended interval >= 3600 so probe traffic stays light. The check accepts any cert chain (the goal is to report expiry status, not enforce trust), so an expired or self-signed cert still produces a structured result rather than a generic handshake error.
  • Domain expiry checks (type = "domain_expiry") query RDAP via a process-shared outbound HTTPS client. The IANA bootstrap registry (https://data.iana.org/rdap/dns.json) is fetched lazily on first use and cached for process lifetime — a registry update or a transient bootstrap failure persists until restart. RDAP servers rate-limit clients, so interval >= 3600 is enforced server-side and daily is typical. SSRF guard does not gate these requests because the destination is an IANA-published endpoint, not the user-supplied domain.
    • Sticky last-good. Each successful probe persists (expiry_at, registrar, last_success_at) to the domain_expiry_state table (PK target_id, denormalised org_id; every query filters on both). On a transient probe failure — throttle, timeout, registry 5xx, RDAP 404, network blip — the executor returns the cached verdict instead of flipping the monitor to Degraded/Down. For Up the customer-facing error field stays empty; Degraded/Down carry a served_stale: … annotation so operators can distinguish a stale serve from a fresh probe. Operators also see the staleness via the uptimepage_domain_expiry_stale_served_total counter.
    • Staleness ceiling: 7 days. A cached row older than 7d is treated as “registry unreachable for too long” and surfaces as a real Error, which is alert-eligible.
    • Cross-tenant singleflight. Concurrent probes for the same domain coalesce to one outbound RDAP request. Cache TTL on the singleflight slot is 60s — short enough that each scheduled cycle still fetches fresh, long enough to absorb scheduler-jitter waves at scale. Counter: uptimepage_rdap_singleflight_total{outcome="hit"|"miss"}.
  • Notification channels are no longer global config. They are per-org runtime resources (Slack / Discord / Teams / Google Chat webhooks, generic HTTP webhook, Telegram bot, WhatsApp Cloud API) created via the /api/v1/notification-channels API; a target binds them by id in its alerts array. Transport secrets are sealed at rest with the credentials KEK and never echoed back. Slack POSTs { "text": "..." }; the generic webhook POSTs the incident-notice JSON (plus any configured custom headers, optionally HMAC-signed — see docs/api.md). Notifications are driven by the incident engine and persisted per attempt, so delivery state survives a restart. The binding syntax and the monitor-level firing policy (confirmations, recovery, reminders, region quorum) are documented in docs/api.md.
  • api.cors opens /api/v1/* to browser-origin access. Each entry in allowed_origins must be a full origin (https://app.example.com) — wildcards are not parsed; set allow_any_origin = true to send Access-Control-Allow-Origin: * explicitly. The two are mutually exclusive — combining them or enabling CORS with an empty list aborts startup. allowed_methods is echoed in the preflight response (Access-Control-Allow-Methods); Access-Control-Allow-Headers is fixed to content-type, which is what the JSON API needs. /healthz and /readyz are not wrapped, so liveness probes are unaffected.

Quotas & rate limits

Every organization is bound to a plan. The plan is the single source of truth for resource quotas and per-minute rate budgets — the number a request is enforced at is the same number the API reports back. Adding a paid tier later is one row in the plans table plus a UI page; nothing in the enforcement path changes.

The free plan

Shipped and seeded on first migration. Generous for a small team, bounded enough to keep abuse on a small VM cheap.

QuotaFreeMeaning
max_targets10Monitored targets in the org
min_check_interval_secs60Plan-side floor on a target’s check interval. The effective floor is max(this, kind_min)kind_min is 3600 for tls_cert / domain_expiry and 10 for http / tcp / dns.
retention_days90Informational — actual check-result retention is the flat ClickHouse table TTL (90d for every org), not this column
max_members5Active members in the org
max_pending_invitations10Outstanding (unaccepted) invitations
max_api_tokens_per_user5API tokens a single user may hold
max_status_pages1Public status pages the org can run
max_public_components10Distinct monitors published across all of the org’s pages (a monitor on several pages counts once)
max_maintenance_windows20Scheduled maintenance windows
max_notification_channels20Notification channels (Slack/webhook/Telegram/WhatsApp/SMS/…) in the org
max_logo_size_bytes1048576Status-page logo upload ceiling (1 MiB)
Rate budget (per minute)FreeCategory
api_writes_per_minute600POST/PATCH/DELETE on /api/v1/*
api_reads_per_minute6000GET/HEAD/OPTIONS on /api/v1/*
bulk_ops_per_minute30/api/v1/targets/bulk*
test_now_per_minute60POST /api/v1/targets/test + the notification-channel test endpoints
check_now_per_minute60POST /api/v1/targets/{id}/check-now

How quotas are enforced

A resource quota is checked atomically at the write, not by a check-then-act in the handler. The friendly handler-side pre-check exists only to produce a clean error on the common, uncontended path; the race-safe guarantee is in the store:

  • Targets — the count bound is inside the INSERT (single and bulk), handed the same max_targets. Concurrent creates at limit - 1 settle at exactly limit, never more.
  • Members — the membership insert runs under a per-org advisory lock, counts, and rolls itself back if it crossed max_members. Re-adding an existing member stays a no-op.
  • Pending invitations — dedupe and the pending cap are enforced in one transaction under the same per-org lock; parallel duplicate-email invites yield exactly one row.
  • Public components — flipping a target public is gated on create, bulk, and PATCH (so “create private, then edit public” cannot bypass the cap).
  • API tokens — count-in-INSERT, scoped per user, handed max_api_tokens_per_user.

Exceeding a resource quota returns 422:

{
  "error": {
    "code": "QUOTA_EXCEEDED",
    "message": "max_targets limit reached: 10 of 10 used on the free plan.",
    "field": null,
    "details": { "quota": "max_targets", "current": 10, "limit": 10, "plan": "free" },
    "trace_id": null
  }
}

The pending-invitation cap is the one exception to the code: it predates the unified envelope and returns 409 INVITATIONS_LIMIT. The cap itself is enforced identically (atomic, never overshoot).

A sub-minimum check interval is its own 422, MIN_CHECK_INTERVAL, enforced on create and PATCH, single and bulk — a target created at the floor cannot be edited below it. The floor is max(plan.min_check_interval_secs, kind_min): the per-kind value (3600 for tls_cert / domain_expiry, 10 for the rest) applies regardless of plan tier — polling an expiry probe faster than once an hour yields no signal.

Rate limiting

Two app-side tiers, both keyed on the authenticated subject (never the TCP peer): (org, category) and (user, category). Both are checked; the org tier fires first because it protects shared resources. The per-minute budget comes from the org’s plan. The request category is derived from the path and method:

  • path contains /bulkbulk_ops
  • path ends /testtest_now
  • path ends /check-nowcheck_now
  • otherwise GET/HEAD/OPTIONSapi_reads, else → api_writes

Exceeding a budget returns 429 with a Retry-After header:

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Too many requests.",
    "field": null,
    "details": { "scope": "per_org_api_writes", "retry_after_secs": 30 },
    "trace_id": null
  }
}

The limiter is a governor cell per (scope, category) key in a DashMap. A janitor evicts entries idle past the threshold so the map stays bounded by the number of active tenants, not by request volume; its lifetime is bound to the limiter so a refactor cannot silently drop the sweep and leak the map. Unauthenticated requests fall through untouched — per-IP limiting for those (auth endpoints, org creation, the public status surface) is the reverse proxy’s job; see Deployment.

Checks themselves are not rate-limited — the scheduler path never enters this middleware, so monitoring throughput is unaffected.

Every quota / rate-limit / abuse rejection is recorded to the append-only quota_events table (event, quota_name, details, hashed IP) as fire-and-forget — it never blocks the response. It is the data source for abuse review.

Usage transparency

EndpointReturns
GET /api/v1/orgs/{id}/usagePlan + current vs limit for every org-scoped quota, policy values, rate budgets, feature flags. Member-gated (a non-member gets the same 404 as GET /orgs/{id}).
GET /api/v1/me/usageThe caller’s api_tokens and owned_orgs current/limit.

The operator UI surfaces the same numbers at /settings/usage as progress bars (an unlimited self-host limit renders as ∞). Reported limit == enforced limit by construction: both read the same plan and the same count query.

Anti-abuse

Two deny-lists, applied when a target is created, bulk-created, updated, or test-run. A block is a 400, audited to quota_events with event = abuse_blocked.

  • URL patterns — a case-insensitive regex set of attack-recon paths (exposed VCS dirs, .env, credential paths, admin panels, WordPress xmlrpc pingback, Spring actuator, backup/dump extensions, …). A match is 400 URL_PATTERN_BLOCKED / ABUSE_BLOCKED. The shipped patterns and the compiled fallback are kept byte-identical by a drift guard.
  • Domains — a YAML deny-list (config/abuse_denylist.yaml) matched hierarchically: listing example.com also blocks eu.status.example.com. It carries the operator’s own domain (don’t monitor yourself) and competing uptime/status providers (monitoring another monitor forms a load-amplification chain). A match is 400 DOMAIN_DENYLISTED. Dedicated monitoring SaaS are listed at the apex; multi-tenant status-page hosts are listed narrowly so legitimate vendor-status checks are not over-blocked.

The list loads once at startup; changes need a restart in this release. A bad regex or malformed YAML is a clean startup config error, never a crash loop.

Configuration

[quotas]
plan_cache_ttl_secs  = 300   # org→plan cache; a plans-table edit takes
usage_cache_ttl_secs = 10    #   effect within this window

A plans-table change is invisible until the plan cache’s TTL elapses (a cache hit is zero DB round-trips on the hot path), then the next lookup refetches.

Single-tenant deploys raise limits the same way SaaS does: edit (or INSERT) the plans row the org is assigned to, or attach a plan_overrides row with the cap fields you want to raise. There is no config-side override knob — every quota lives in Postgres so the audit-trail covers both modes.

Every numeric quota / rate / interval is validated at config load — < 1 is rejected with the offending field named, never a panic in router or limiter construction.

The reverse-proxy per-IP tiers (auth endpoints, org creation, public surface) are documented in Deployment.

Metrics

Prometheus exposition on metrics_bind (default 127.0.0.1:9090/metrics).

Series

Names below are the on-wire names exactly as registered in src/observability/metrics.rs (observability::metrics::names) and sampled in src/observability/sampler.rs. Dashboard queries must use these names verbatim.

NameTypePurpose
uptimepage_checks_total{status}counterchecks completed, partitioned by terminal status (up/down/degraded/error)
uptimepage_checks_errors_total{kind}countererror breakdown by kind; currently only circuit_open is emitted (a check skipped because its host breaker was open)
uptimepage_check_redirects_total{outcome}counterHTTP redirect hops (followed / limit_exceeded / invalid_location / blocked_scheme)
uptimepage_circuit_breaker_state_changes_total{from,to}counterbreaker state transitions
uptimepage_storage_writes_total{store,result}counterbatcher flush outcomes
uptimepage_storage_dropped_results_total{reason}counterresults dropped before reaching the sink (queue full, etc.)
uptimepage_notifications_total{channel,kind}counteralert notifications dispatched
uptimepage_notifications_failures_total{channel}counternotification dispatches that returned an error
uptimepage_alerts_dropped_total{reason}counterincident paging signals dropped before reaching the escalation engine, by NotificationReason (opened/escalated/resolved/reopened/no_data/data_resumed). A lifecycle change never blocks on paging throughput, so a saturated signal channel drops here; the incident row stays in Postgres for the reconcile sweep
uptimepage_notifications_dead_lettered_total{transport}counterincident pages that exhausted all retries without delivering, by transport
uptimepage_telegram_send_deferred_totalcounterTelegram sends held back by the per-bot/per-chat send budget rather than sent immediately. Sustained growth means the central bot is rate-limit bound
uptimepage_host_throttle_waits_total{kind}counterper-(org,host,port) (kind=host) or per-TLD RDAP (kind=rdap) throttle acquire attempts
uptimepage_host_throttle_drops_totalcounterhost-bulkhead rejections — kind=host over-cap checks recorded as degraded without firing alerts. RDAP drops do NOT increment this counter; they fall through to the sticky last-good path (see domain_expiry_stale_served_total)
uptimepage_rdap_singleflight_total{outcome}counterRDAP singleflight outcome per domain — hit (cached, no outbound request) or miss (fetcher invoked)
uptimepage_domain_expiry_stale_served_total{kind}countertimes the domain-expiry executor served a cached last-good answer instead of a fresh probe. kind distinguishes the cause: throttled, timeout, lookup_error, or fresh_error (no usable last-good — emitted as a real Error instead)
uptimepage_domain_expiry_state_write_failed_totalcounterfailures writing the last-good cache row after a successful probe. Sustained values mean the sticky cache is going cold even though probes succeed — typical cause is Postgres write degradation
uptimepage_scheduler_refresh_failed_totalcounterregistry refresh ticks that returned an error from Postgres. Alert on a sustained rate above your normal noise floor; persistent failures put the scheduler into exponential backoff (capped at 10× the configured refresh interval) and keep workers running with cached ScheduledTarget snapshots
uptimepage_rdap_singleflight_slotsgaugelive entries in the in-process RDAP singleflight cache. Bounded under normal load by the set of monitored domains; sudden growth signals a code path feeding non-target domains into the cache
uptimepage_scheduler_consecutive_refresh_failuresgaugeconsecutive registry refresh failures since the last success. Primary alarm signal for a stuck scheduler — page when the value stays above 5 for more than a few minutes. Resets to 0 on the first successful refresh
uptimepage_scheduler_refresh_duration_mshistogramwall-clock duration of one registry refresh tick (Postgres query + decode + DashMap diff). p99 climbing into the hundreds of ms means the current full-scan refresh is starting to strain at scale — the trigger for the deferred incremental-sync work
uptimepage_build_info{version}counterset to 1 once at startup so the endpoint is never empty
uptimepage_check_duration_mshistogramper-check wall time. The uptimepage_check_*_ms family is exposed as histogram buckets (not summary quantiles) so percentiles aggregate correctly across regions; query with histogram_quantile()
uptimepage_check_dns_mshistogramDNS resolution latency (recorded in the hickory wrapper)
uptimepage_check_connect_mshistogramTCP connect latency (every HTTP check connects fresh)
uptimepage_check_tls_mshistogramTLS handshake latency (per HTTPS check)
uptimepage_check_ttfb_mshistogramtime-to-first-byte: request sent to response headers
uptimepage_storage_batch_sizehistogramflush batch sizes
uptimepage_storage_write_duration_mshistogramflush durations
uptimepage_telegram_send_wait_mshistogramwait imposed on a Telegram send by the send budget before its slot opened
uptimepage_targets_totalgaugetargets in this process’s scheduler registry (sampled). Non-zero only where in-process probing runs; a brain doing agent-only probing reports 0 by design — use uptimepage_targets_enabled for the configured-monitor count
uptimepage_targets_enabled{kind}gaugeconfigured enabled monitors counted from Postgres, by kind. Slow-cadence inventory gauge, scrape-cached so request load never reaches Postgres; correct on a brain regardless of where probing runs
uptimepage_users_activegaugenon-deleted user accounts counted from Postgres. Slow-cadence inventory gauge, scrape-cached
uptimepage_workers_in_flightgaugecurrent worker-pool semaphore depth (sampled). Emitted by every probing process, so on a brain doing agent-only probing the real value is on the agent’s role=probe series, not the brain’s near-zero one
uptimepage_result_queue_depthgaugedepth of the result channel buffer (sampled). Present on both the agent (egress to the control plane) and the brain (ingest to storage); separate them by role
uptimepage_circuit_breakers_opengaugecurrently-open breakers (sampled). Probe-side — read the role=probe series
uptimepage_monitors_unmonitoredgaugemonitors whose covering probes have all gone silent (no fresh results), from the silence sweep. Distinct from down: these have no data at all
uptimepage_agent_up{region,agent}gauge1 if a regional agent checked in within the staleness window, else 0. Emitted by the control plane from agents.last_seen_at, so it covers remote agents that Alloy can’t scrape. Per-agent series can freeze on agent removal, so alerts use uptimepage_agents_enabled_down
uptimepage_agent_last_seen_age_seconds{region,agent}gaugeseconds since a regional agent last checked in. Climbs unbounded when an agent goes dark
uptimepage_agents_enabled_downgaugecount of enabled regional agents currently past the staleness window. Recomputed every sweep so it never latches. The dead-man signal for a probe region going dark
uptimepage_region_agents_total{region}gaugeenabled agents configured for a region — the quorum denominator. Brain-side from the agents table
uptimepage_region_agents_up{region}gaugeenabled agents in a region fresh within the staleness window — the quorum numerator. up / total is the region’s health fraction; up == 0 means the region’s agents have all gone stale. Recomputed each sweep; like the per-agent gauges it can freeze if a region’s last agent is removed. Covers agents Alloy can’t scrape
uptimepage_region_checks_window{region}gaugechecks completed in a region over the recent sampling window. Brain-side count from ClickHouse, so it covers remote agents Alloy can’t scrape. Only regions with results in the window appear
uptimepage_region_checks_up_window{region}gaugechecks that returned up in a region over the recent window. Divide by uptimepage_region_checks_window for the success ratio
uptimepage_region_check_latency_p95_ms{region}gaugeapproximate p95 check latency in a region over the recent window, in ms. Goes stale for a dark region (no new rows), so gate panels on uptimepage_region_agents_up
uptimepage_pg_pool_sizegaugetotal connections held in the sqlx Postgres pool (idle + in-use). Bounded above by storage.postgres.max_connections
uptimepage_pg_pool_idlegaugeconnections sitting idle in the Postgres pool. A persistent idle = 0 alongside in_use at the max is the saturation signal
uptimepage_pg_pool_in_usegaugeconnections checked out of the Postgres pool right now (size − idle). Alert on a sustained high in_use / size ratio
uptimepage_process_resident_bytesgaugeresident set size of the process (VmRSS) in bytes. Linux only — absent on non-Linux dev runs. Early-warning signal for slow leaks ahead of the OOM killer
uptimepage_clickhouse_max_part_count_for_partitiongaugeClickHouse MaxPartCountForPartition (sampled from system.asynchronous_metrics). Partition-explosion early warning — climbs toward parts_to_throw_insert (default 3000) if a high-cardinality column is added to PARTITION BY
uptimepage_http_requests_total{method,route,status}counterinbound HTTP requests handled. route is MatchedPath (the path-pattern with placeholders) — cardinality bounded by the static router table, never by per-tenant ids. status is bucketed 2xx/3xx/4xx/5xx/other; query sum by (status) (rate(...[5m])) for the SLO ratio
uptimepage_http_request_duration_ms{method,route}histograminbound HTTP request latency, exposed as summary quantiles (single web instance, no cross-instance merge). Query name{quantile="0.99"} for tail latency per route
uptimepage_http_responses_inflightgaugeinbound HTTP requests currently being served. Climbing alongside flat throughput points at handler back-pressure on a downstream pool
uptimepage_ratelimit_drops_total{scope}counterHTTP 429s from the per-org / per-user rate-limit middleware. scope is the same string carried in the error body (per_org_api_writes, per_user_bulk_ops, …) so dashboards can join with record_quota_event audit rows. Abuse signal — a tenant hammering the API spikes one scope before shared resources notice

Scrape interval of 15 s is plenty — counters are written from hot tokio tasks; histograms aggregate per bucket without lock contention.

Histogram exposition. Two forms. The uptimepage_check_*_ms family is configured with explicit buckets and exported as a Prometheus histogram (name_bucket{le="..."} plus name_sum / name_count); query it with histogram_quantile(0.99, sum(rate(name_bucket[5m])) by (le)) so percentiles pool correctly across regional agents. Every other *_ms / *_size histogram keeps the default exposition, a Prometheus summary with precomputed quantile series (name{quantile="0.5|0.9|0.95|0.99|0.999"}) plus name_sum and name_count; query those as name{quantile="0.99"} directly. Gauges carry no org_id label, these are single-instance operator metrics, not per-tenant.

Scrape labels. The collector stamps two labels the app does not set: role (control-plane on the brain, probe on a regional agent) and, on probe series, region. The brain and a probe both emit the prober and pipeline metrics (check_*, workers_in_flight, circuit_breakers_open, result_queue_depth, storage_*, process_resident_bytes), so filter by role to read the one you mean rather than summing two processes. The Ops dashboard pins probe panels to role=probe and filters them by a $region variable; the Business dashboard reads the control-plane-only inventory gauges.

The uptimepage_region_* gauges are different: the brain emits them with a region label it sets itself (from the agents table and from ClickHouse), not a collector-stamped scrape label. They are the per-region surface on a SaaS control plane, where the regional agents are not scraped at all: liveness and quorum from the agents table (region_agents_up / _total), throughput and latency from ClickHouse (region_checks_window / _up_window / region_check_latency_p95_ms). One scrape point, cost scales with regions, not tenants or fleet size.

OpenTelemetry tracing

Spans are exported over OTLP/HTTP (protobuf) when both observability.tracing_enabled and observability.grafana.enabled are true. The exporter targets observability.grafana.otlp_endpoint (the OTLP base; /v1/traces is appended) and authenticates with Authorization: Basic base64(instance_id:api_key). The destination is any OTLP/HTTP collector — Grafana Cloud Tempo, Jaeger, an OpenTelemetry Collector, etc.

  • api_key is read only from UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY — never from a file.
  • Sampling is parent-based over a head ratio (grafana.trace_sample_ratio, default 0.05); a sampled parent keeps its children.
  • Resource attributes: service.name = uptimepage, service.version = the build version.
  • Disabled by default and zero-cost when off: no exporter is built, no network egress, no per-check overhead.
  • A batch exporter ships spans in the background; it is flushed and stopped on graceful shutdown so the final spans are not lost. A transport build failure logs a warning and the service continues without traces — telemetry never takes down monitoring.

Inconsistent settings (export on but endpoint/instance/key missing, or a sample ratio outside [0.0, 1.0]) fail fast at startup as a config error, not a runtime surprise. See Configuration for the keys and env overrides.

HTTP connection phase timings

Every HTTP check opens a fresh connection (no pool — a monitor probes each target once per interval, so a pool rarely reused a socket, and fresh-connect is what lets the probe attribute time to each phase). check_dns_ms, check_connect_ms, and check_tls_ms are timed during that establishment and check_ttfb_ms from request-send to response headers. The same four values are written per-check into ClickHouse, which is what powers the detail-page latency-breakdown chart.

Deployment

Production deployment with Caddy + basic auth

For real-world operation, use the production stack under deployment/ in the repo. It puts a Caddy reverse proxy in front of the Rust service with:

  • Automatic TLS via Let’s Encrypt (HTTP/2 and HTTP/3 on by default)
  • Basic auth on the UI and API
  • Postgres and ClickHouse on the internal docker network — no published ports
  • ClickHouse memory-capped at ~2 GB (see deployment/clickhouse-config.xml)

Setup:

cd deployment
cp .env.example .env
$EDITOR .env            # set domain, ACME email, bcrypt hash, DB passwords, KEK
docker compose up -d

deployment/README.md is the authoritative source for setup, user management, password rotation, backups, and troubleshooting.

Authentication boundary

The Rust service ships an in-binary auth stack (GitHub OAuth + opaque API tokens; magic-link sign-in is gated by config). The native auth is the boundary; a basic-auth layer in front of Caddy would double-prompt. Single-tenant deploys behave the same way — sign up as the first user and the operator surface is yours.

/healthz and /readyz are intentionally exposed without auth so uptime probes, load balancers, and orchestrators can hit them. /metrics on the public domain returns 404 — scrape it on the internal docker network instead.

The public status page (/status, /status/*, /api/public/*, /static/*, /robots.txt, /favicon.ico) is also unauthenticated by design — see Public status surface below.

See Authentication for the in-binary flow.

Email provider (Resend)

Transactional email (invitations, magic-link sign-in) goes through the EmailSender trait. Production uses Resend; dev and test default to the log provider, which writes the action URL to the tracing log so you can copy-paste it into a browser.

Setup:

  1. Create a Resend account and verify your sending domain. Resend will give you DKIM and DMARC records to add to DNS.

  2. Generate an API key with emails.send permission only.

  3. Configure the service:

    [email]
    provider = "resend"
    from_name = "Acme Status"
    from_address = "no-reply@status.acme.test"
    
    [email.resend]
    api_key = "re_…"
    

    Or via env: UPTIMEPAGE_EMAIL__PROVIDER=resend, UPTIMEPAGE_EMAIL__RESEND__API_KEY=re_….

  4. auth.public_base_url must be set to the externally-reachable origin (e.g. https://status.acme.test); the value is embedded in the links the recipient receives.

The factory rejects boot when provider = "resend" is set without a non-empty API key — fail-fast over send-time surprise.

Public status surface

The Caddyfile carries an @public matcher that short-circuits basic_auth for the public status paths and adds a per-IP rate limit (60 req/min) via the caddy-ratelimit plugin. The stock caddy:2-alpine image doesn’t include that plugin, so the production deployment uses a custom custom-caddy:2 image built with xcaddy:

docker build -t custom-caddy:2 - <<'EOF'
FROM caddy:2-builder AS builder
RUN xcaddy build --with github.com/mholt/caddy-ratelimit

FROM caddy:2-alpine
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
EOF

Then point the caddy service in deployment/docker-compose.yml at custom-caddy:2. Full procedure (including the opt-out path that drops the rate-limit block) is in deployment/README.md.

The same custom image carries two more per-IP zones: auth_endpoints (10/min on /auth/*, /api/v1/me, invitation accept) and org_creation (3 per 24 h on POST /api/v1/orgs). These are the edge tier; the per-org / per-user budgets the service enforces from each org’s plan are the Quotas & rate limits tier — complementary, since behind the proxy the app sees only the proxy as the peer.

Per-org subdomains (SaaS)

When tenancy.subdomain_public_routes = true, each org’s page is served at {slug}.{public_status.base_domain} (apex-wildcard shape). That needs:

  • a wildcard DNS record *.{domain} pointing at the host (plus explicit A/AAAA records for any operator subdomain — app, mail, etc. — which take precedence over the wildcard);
  • a wildcard TLS cert for *.{domain}. HTTP-01 can’t validate a wildcard, so the custom Caddy image also bundles caddy-dns/hetzner and solves the ACME DNS-01 challenge using a HETZNER_DNS_API_TOKEN (zone-edit scope) from .env. The operator host (app.{domain}) is kept on its own per-host HTTP-01 cert in a separate Caddyfile block so a wildcard-key compromise does not reach the operator surface.

The wildcard means a new org’s page works the moment its owner enables it — no per-org DNS or cert step. The end-to-end runbook (Hetzner zone setup, token scope, building the image, verifying the wildcard cert) is in deployment/README.md. The model — host routing, branding, opt-in gating, cookie scoping — is in Per-org status pages.

For the operator workflow (enabling components, narrating incidents, scheduling maintenance) see Public status page.

Docker

docker compose up -d brings up Postgres 17, ClickHouse 26.3, and the monitor on the same network. Compose env vars wire the monitor to the stack:

UPTIMEPAGE_STORAGE__POSTGRES__URL: postgres://monitor:monitor@postgres:5432/monitor
UPTIMEPAGE_STORAGE__CLICKHOUSE__URL: http://clickhouse:8123
UPTIMEPAGE_STORAGE__CLICKHOUSE__USER: monitor
UPTIMEPAGE_STORAGE__CLICKHOUSE__PASSWORD: monitor
UPTIMEPAGE_OBSERVABILITY__LOG_FORMAT: json

The runtime image is gcr.io/distroless/static-debian12:nonroot for a minimal attack surface, no shell, and no glibc. Built from a static musl binary via rust:1-alpine. Final image is 16 MB — both uptimepage and loadtest binaries fit in the same image.

Bind addresses

Defaults are loopback (127.0.0.1:8080 API, 127.0.0.1:9090 metrics). Override via env for non-loopback exposure:

UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080 \
UPTIMEPAGE_SERVER__METRICS_BIND=0.0.0.0:9090 \
./uptimepage

There is no built-in auth on the API port. Front it with a proxy or keep it on a private network. The ready-made Caddy stack under deployment/ does this for you.

Metrics shipping (Grafana Cloud)

The Prometheus /metrics endpoint can be shipped to Grafana Cloud by a Grafana Alloy sidecar. It is opt-in: the compose stack only starts it under the metrics profile (docker compose --profile metrics up -d), so the default deployment is unchanged. Credentials are read from .env (gitignored) and never written into deployment/config.alloy.

deployment/README.md (“Metrics”) is the authoritative setup, including how to obtain the Grafana Cloud URL/token, the internal-network bind, the ready-made dashboard, and how to verify ingestion.

Migrations

  • Postgres: migrations/postgres/*.sql, applied at startup via sqlx::migrate! (tracked in _sqlx_migrations)
  • ClickHouse: migrations/clickhouse/*.sql, applied idempotently via CREATE … IF NOT EXISTS at startup

No external migrator. The app owns its schema lifecycle symmetrically.

Resource sizing

  • checker.max_concurrent_checks caps simultaneous in-flight checks
  • Per-check memory: small (a tokio task + an in-flight hyper request + bookkeeping)
  • The practical ceiling is set by file descriptors and ephemeral ports, not RAM
  • At 50k concurrent checks against external targets, RSS sits around 200-400 MB depending on response sizes
  • The optional metrics profile adds a Grafana Alloy container (~100 MB RSS plus a small bounded remote-write WAL volume) — account for it when sizing the host if you enable it

Graceful shutdown

The binary listens for SIGINT and SIGTERM, cancels the scheduler and batcher via a shared CancellationToken, awaits both background tasks, and exits within 10 s. The batcher’s cancel branch drains any pending results before returning. A warning is logged if the deadline is exceeded.

Development

Local setup for iterating on the service. For production deployment see deployment.md.

Prerequisites

  • Rust 1.95+ (edition 2024) via rustup
  • Docker + Docker Compose (for Postgres + ClickHouse)
  • Optional: just (brew install just) — every workflow below has a one-word just recipe equivalent. Run just to list them.

Two workflows

First buildIncrementalNotes
Host workflow~2 min~3 scargo run natively; only deps in Docker. Best for iteration.
Docker dev (cargo-watch)~3 min~3 sSource bind-mounted, rebuilds happen inside the container with a cached target/. Live reload.
Docker prod-shape~5 min~30 sRebuilds image. Matches the prod build. Use for CI-shaped smoke tests.

Bring up just Postgres + ClickHouse:

docker compose -f compose.dev.yml up -d

Run the binary natively:

cargo run --bin uptimepage

config/default.toml already points at localhost:5432 and localhost:8123, so no env overrides are needed. Edit code → Ctrl-C → cargo run again.

Tear down (keeps DB volumes):

docker compose -f compose.dev.yml down

Wipe data too:

docker compose -f compose.dev.yml down -v

Docker dev workflow (live reload inside a container)

Runs the binary inside a container that bind-mounts the repo and re-runs cargo run via cargo-watch on every source change. The compiled target/ and the linux Tailwind CLI live in named volumes, so they persist across restarts and don’t clash with the host build.

docker compose -f compose.dev.yml --profile dev-app up -d --build
docker compose -f compose.dev.yml logs -f uptimepage

First run takes ~3 min (toolchain + cargo-watch install + cold build + Tailwind fetch). After that, edits to src/, templates/, or static/css/input.css trigger an incremental rebuild + restart inside the container, typically under 5 s.

Don’t combine this with cargo run on the host — both bind 8080.

Stop just the app (keep pg + ch up):

docker compose -f compose.dev.yml stop uptimepage

Docker prod-shape workflow (full stack via Dockerfile)

docker compose up -d --build uptimepage

The Dockerfile uses cargo-chef to split dependency compile from app compile. The first build is slow; later src-only edits skip the dep cook layer and finish in ~30 s.

If you have the host workflow running and want to switch to docker, stop the native binary first to free port 8080 (or stop the docker service first to free the host port).

Verify it’s up

curl http://localhost:8080/healthz   # liveness
curl http://localhost:8080/readyz    # readiness (DBs reachable)

Browse:

  • http://localhost:8080/ — operator dashboard
  • http://localhost:8080/status — public status page
  • http://localhost:8080/docs — Swagger UI

Operator UI locally

The dev-app container runs the same SaaS code path as production. The host workflow (cargo run against config/default.toml) does too — the binary is multi-tenant SaaS in every environment; a single-tenant deploy is just a SaaS deploy with one signed-up user.

Get an authenticated owner session without GitHub OAuth:

just up-app          # SaaS-mode stack; wait for "api listening"
just dev-login       # seeds user+org+owner+session, prints the cookie

Then, in the browser devtools Console at http://localhost:8080:

document.cookie = "_sm_session=devsession-localtest-0000000000; path=/";

Reload — you’re the owner of “Dev Org”. The public page is at http://devorg.lvh.me:8080/status (*.lvh.me resolves to 127.0.0.1, no /etc/hosts edit). just dev-login also prints a curl snippet that passes the cookie directly, for API-only checks.

After editing a migration in place (pre-launch policy), the dev DB trips sqlx’s “migration N modified” checksum guard — just db-reset drops and recreates it (ClickHouse and the warm build cache are kept). down -v wipes the seeded session; re-run just dev-login.

Seed a target

curl -sS -X POST http://localhost:8080/api/v1/targets \
  -H 'content-type: application/json' \
  -d '{
    "name": "example",
    "check": {"type":"http","url":"https://example.com/","method":"GET",
              "timeout":5000,"follow_redirects":false,"max_redirects":0,
              "expected_status":{"kind":"exact","value":200},
              "headers":{},"verify_tls":true},
    "interval": 60, "enabled": true, "tags": [],
    "public_status": true
  }'

public_status: true makes the target appear on /status and addressable via /api/public/v1/badge.svg?component=<id>.

Seed UI fixtures

For end-to-end UI smoke (every public-page render path, varied check_spec kinds, notification channels, alert bindings, maintenance binding, adversarial title) use the bulk fixture script after just dev-login:

just seed-fixtures

What it seeds (under the seed-fixtures tag, idempotent):

  • 14 monitors — 8 public (covering all 5 component states: Operational / Degraded / Partial outage / Major outage / Maintenance — plus the disabled-target and ungrouped render paths) and 6 internal exercising every check_spec kind (http / tcp / dns / tls_cert / domain_expiry).
  • 161 incidents — 150 resolved across 87 days (cleared the 50-incident cap so the “Older incidents →” archive link renders), 10 active in mixed phases (investigating / identified / monitoring), 1 adversarial-title incident covering the day-popover JSON-escape path.
  • 90-day ClickHouse history — per-target divergent shape via cityHash64(tid) (each component has a distinct uptime% and outage pattern), an explicit 87-89d “ancient outage” cluster on the first three targets, and a 6-day NoData gap on fix-email.
  • 9 notification channels — one per ChannelConfig variant (slack, webhook, whatsapp, discord, msteams, google_chat enabled; email enabled but unverified; telegram and telegram_app disabled), with alert bindings on fix-api / fix-db / fix-auth mixing notify_recovery on/off and single/multi-channel bindings.
  • 4 maintenance windows — 1 active (bound to fix-db), 2 upcoming, 1 past.

The script ends with a post-seed verification block that prints Postgres row counts, per-component last-5-min counters with an expected-vs-actual state matrix, an HTTP smoke against the public page, the adversarial-title escape check, and a 90-day ASCII day-strip per component. Exits non-zero on any mismatch — safe to chain in CI.

Env overrides: SLUG=<org> (default devorg), RESET_CH=0 to skip ClickHouse purge if you want to layer additional rows on top of a prior seed (default 1).

Then visit:

Logging

docker-compose.yml sets the default level to:

uptimepage=debug,sqlx=warn,hyper=warn,tower_http=info,info

For the host workflow, pass it directly:

RUST_LOG="uptimepage=debug,sqlx=warn" cargo run --bin uptimepage

RUST_LOG always wins over the config file. Anyhow errors are printed with {:#} from the public-status cache, so the full context chain shows up without re-running with backtraces.

Stream container logs:

docker compose logs -f uptimepage

Faster builds

just setup        # once: sccache + cargo-nextest, and the linker
                  # (mold on Linux; macOS prints an lld opt-in snippet)
just check        # primes test-profile artifacts so `just test` skips
                  # the rebuild a `cargo check` -> `cargo test` profile
                  # switch would otherwise force
  • Toolchain: rust-toolchain.toml pins 1.95 for every entrypoint (bare cargo, just, rust-analyzer, CI) — no more ad-hoc cargo +1.95.
  • Linker: .cargo/config.toml selects mold for Linux targets, so just, bare cargo, and rust-analyzer share one build fingerprint (an env RUSTFLAGS that differed between them would double-build target/). A Linux build needs mold installed — just setup. macOS is opt-in (Apple clang needs lld’s machine-specific absolute path; just setup prints the ~/.cargo/config.toml snippet).
  • sccache: compile cache for local dev (just sets RUSTC_WRAPPER only when present) and CI (mozilla-actions/sccache-action, with Swatinem/rust-cache reduced to cache-targets: false so they don’t double-store). Not in the release Dockerfile — cargo-chef already layer-caches deps there and the sccache mount wouldn’t survive CI.
  • CI installs the linker via rui314/setup-mold; the dev-app container via apk add mold + a persistent sccache volume.

Tests

cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test
cargo test --release
cargo bench

Postgres-backed tests (e.g. bulk_create_with_ragged_tags) are #[ignore]’d by default and no-op when DATABASE_URL is unset. Bring up the stack and opt in. Validate schema/migration changes against a throwaway DB, not the stale monitor one (the harness auto-applies migrations on first connect):

docker compose -f compose.dev.yml up -d
docker compose -f compose.dev.yml exec -T postgres createdb -U monitor ci_verify

# Whole ignored suite (slow — builds every test binary):
DATABASE_URL=postgres://monitor:monitor@127.0.0.1:5432/ci_verify \
  cargo test -- --ignored

# One suite (fast — scope to a binary; bare `nextest run` rebuilds +
# enumerates all ~48 test binaries and looks frozen for minutes):
DATABASE_URL=postgres://monitor:monitor@127.0.0.1:5432/ci_verify \
  cargo test --test status_page_settings_test -- --ignored --nocapture

Database access

docker compose exec postgres psql -U monitor -d monitor
docker compose exec clickhouse clickhouse-client -u monitor --password monitor -d monitor

Same commands work against compose.dev.yml; the service names are identical.

Web UI

The single binary serves both the /api/v1/* JSON surface and a server-rendered HTML UI at /. Stack:

  • askama 0.16 + askama_web 0.16 — compile-time HTML templates under templates/. Type mismatches fail cargo build.
  • HTMX 2.0.9 + json-enc — bundled under static/js/. Powers partial swaps (filter, paginate, delete) and JSON form submission. No SPA framework.
  • Tailwind CSS 4 — CSS-first config in static/css/input.css (@source, @theme, @layer components). No tailwind.config.js.
  • ECharts 6 — lazy-loaded from page-level <script> tags, only where charts exist (dashboard, target detail).

build.rs runs ./bin/tailwindcss --minify before each cargo build. First build fetches the standalone CLI (~30 MB) via scripts/fetch-tailwind.sh; subsequent builds reuse it. After cargo build --release you have one self-contained executable with every template, CSS byte, and vendored JS file embedded via rust-embed.

Routes

PathOwner
GET /dashboard (auto-refreshes via HTMX every 5 s)
GET /targetstargets list + filters
GET /targets/{id}target detail with charts and time-range nav
GET /targets/new, /targets/{id}/editforms posting JSON to /api/v1/targets
GET /web/targets/listtbody fragment for filter/paginate swaps
GET /web/partials/dashboardchrome-free fragment for the 5 s refresh region
GET /docsSwagger UI generated from /api/openapi.json
GET /static/*embedded assets (css/, js/, img/)

Every UI mutation hits an existing /api/v1/* endpoint — there are no /web/* write routes, which keeps the API the single source of truth and makes a future SvelteKit port a templates-only rewrite.

Adding a new page

  1. Add a template under templates/ (extend base.html).
  2. Add a #[derive(Template, WebTemplate)] struct and handler in src/web/views/.
  3. Register the route in src/web/routes.rs.
  4. Tailwind picks up new utility classes automatically via the @source "../../templates/**/*.html" directive.

UI tests

  • Unit (render): every view in src/web/views/ ships a #[test] that renders the template with a fixtures struct and asserts on the output (presence of the HTMX hooks, redaction sentinels, table scaffolding).
  • End-to-end: tests/web_e2e_test.rs drives the merged API+web router via tower::ServiceExt::oneshot, covering dashboard / list / detail / forms / 404 paths and verifying credential redaction never leaks real values into HTML.
cargo test --lib web::          # unit render tests
cargo test --test web_e2e_test  # e2e

Troubleshooting

SymptomLikely cause
503 STATUS_DATA_UNAVAILABLEAggregator’s first compute failed. Check uptimepage::public_status::cache ERROR log for the actual SQL/CH error.
docker compose up --build takes 5 min on every changeYou’re on the pre-cargo-chef Dockerfile. Pull latest.
Native cargo run fails with Connection refusedcompose.dev.yml isn’t up, or you forgot to release port 8080 from a running container.

Load test

End-to-end harness. Spawns workers driving the production check executor against in-process mock servers. Different from the micro-benchmarks, which measure single-call cost via Criterion.

cargo run --release --bin loadtest

Linux verification (Docker)

50k concurrent runs need Linux kernel knobs that macOS doesn’t expose. The compose stack ships a loadtest profile that runs the binary inside a Linux container with the required sysctls and ulimits:

docker compose --profile loadtest build loadtest
docker compose --profile loadtest run --rm loadtest

# override on the fly
docker compose --profile loadtest run --rm \
  -e CONCURRENCY=100000 -e DURATION_SECS=60 loadtest

The container sets net.core.somaxconn=8192, net.ipv4.tcp_tw_reuse=1, net.ipv4.ip_local_port_range=10000 65535, and nofile=1048576 — none require --privileged since these sysctls are namespaced.

Env

EnvDefaultPurpose
CONCURRENCY50000concurrent virtual workers
DURATION_SECS30how long to drive load
TIMEOUT_MS5000per-check request timeout
MOCK_PORTS16parallel in-process mock listeners — spreads 4-tuple load to avoid loopback ephemeral-port exhaustion
RAMP_SECS2worker start stagger window — avoids thundering-herd SYN bursts at listen() backlog
HTTP20when 1, client speaks HTTP/2 with prior knowledge (RFC 7540 §3.4). Single TCP connection multiplexes many streams; necessary to drive 50k workers on macOS where ephemeral src ports cap at ~16k

What it does

Spawns MOCK_PORTS axum servers returning 200 ok, then drives workers in a tight loop using the same build_clients + check executor the production binary uses. Prints rolling RPS during the run and total / success / rps / p50 / p95 / p99 / error-kind histogram at the end.

macOS notes

  • kern.ipc.somaxconn caps listener backlog at 128 per socket (hard kernel limit)
  • Ephemeral src port range: 49152–65535 = 16,384 ports
  • TIME_WAIT lingers 30 s, holding closed ports

For 50k-concurrency runs use HTTP2=1 to fold many streams onto a few TCP connections. Linux defaults (ephemeral 32-61k, tunable somaxconn) handle 50k HTTP/1 natively.

Reference numbers

Substrate caveat. Every number below was captured on a developer laptop (Apple M1 Pro, 10 cores, 16 GB). Useful for regression detection (“did this change hurt the hot path?”) and for relative comparisons between commits — not for production capacity planning. Treat them as floors, not ceilings: a real Linux host on server hardware will outperform; a constrained VM will underperform. When sizing for production, re-run on the target topology.

macOS host (M1 Pro, 10 cores, loopback)

DateConfigResult
2026-05-14CONCURRENCY=50000 MOCK_PORTS=8 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300252,114 rps · 100% success · 75.7M checks · p50 181 ms · p95 283 ms · p99 393 ms
earlierCONCURRENCY=50000 MOCK_PORTS=8 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300151,614 rps · 100% success · 45.5M checks · p99 579 ms
earlierCONCURRENCY=12000 MOCK_PORTS=24 RAMP_SECS=10 DURATION_SECS=300 (HTTP/1)27,894 rps · 99.79% success · p99 2.7 s

The 2026-05-14 run is the current headline: 252 k rps sustained, p99 393 ms, zero errors over 5 minutes. Captures the hot path with the multi-tenancy work merged. Native macOS loopback on Darwin 25.4 reaches 50 k concurrent HTTP/2 without the docker crutch — the older “macOS can’t do 50 k loopback” note in earlier docs is stale.

Linux container (Docker Desktop VM on Mac)

DateConfigResult
2026-05-14CONCURRENCY=50000 MOCK_PORTS=16 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300 (10 vCPU allocated)17,391 rps · 100% success · 5.25 M checks · p99 4.2 s · 26 timeouts
2026-05-12CONCURRENCY=50000 MOCK_PORTS=16 RAMP_SECS=10 HTTP2=1 DURATION_SECS=30093,350 rps · 100% success · 28.1 M checks · p99 1.8 s · 933 MiB RSS peak

The 2026-05-14 docker run regressed sharply versus the 2026-05-12 reference on the same hardware. CPU was not the bottleneck (10 vCPU allocated and not pegged); the regression sits in the Docker Desktop networking layer — likely the DOCKER_INSECURE_NO_IPTABLES_RAW flag and iptables-rule changes between DD versions. Same checkout’s native run on the same box hit 252 k rps, so the binary is fine; the VM substrate isn’t.

Docker is no longer the right way to validate this binary’s perf on macOS. Prefer the native run above; reach for a real Linux host (CI runner, staging VM) when you actually need a Linux number.

HTTP/1 vs h2c trade-off

HTTP/1 exercises connect / pool churn — closer to “monitor checks N legacy endpoints” reality. h2c stresses HTTP/2 framing and flow control — closer to “monitor checks N gRPC / modern HTTPS endpoints with ALPN”. Production monitors hit both. Default is HTTP/1; flip HTTP2=1 when ephemeral exhaustion masks signal you actually care about.

Benchmarks

Criterion micro-benchmarks under benches/. Measure execute_http_check end-to-end through the same hyper-util client path the service uses in production.

cargo bench --bench http_client
cargo bench --bench public_status_ttfb   # requires `just up` (PG + CH)

Substrate caveat. Every number on this page was captured on a developer laptop (Apple M1 Pro, 10 cores, 16 GB). Useful for regression detection across commits — not for production capacity planning. A real Linux server will outperform; a constrained VM will underperform. When sizing for production, re-run on the target topology.

What the bench measures

BenchUnit
http_check_singleone execute_http_check call against in-process axum mock, h2c prior-knowledge
http_check_throughputc concurrent calls via join_all, varying c ∈ {100, 1000, 10000, 50000}

Each variant runs under two pinned topologies:

  • 1c — server + client share one OS thread (current_thread runtime). Single-core ceiling.
  • 2c — server on its own thread, client on the bench thread. Two-core ceiling.

Pinning makes results reproducible across machines: no num_cpus() drift.

Single-core results (hyper-util, 2026-05-14)

M1 Pro, loopback h2c, mock returns 200 ok:

BenchLatency (median)ThroughputΔ vs reqwest baseline
http_check_single/1c37 µs26.8 K rps−21% latency · +17% rps
http_check_throughput/1c/c_100778 µs128 K rps−35% latency · +54% rps
http_check_throughput/1c/c_10007.45 ms134 K rps−36% latency · +56% rps
http_check_throughput/1c/c_1000080.6 ms124 K rps−30% latency · +44% rps
http_check_throughput/1c/c_50000422 ms118 K rps−31% latency · +44% rps

One CPU sustains ~130 K checks/sec. Per-check overhead at saturation = 1/130000 ≈ 7.7 µs.

Saturation reached by c=1000. Larger concurrency = more wall time, same rps — bottleneck shifts to in-thread cooperative scheduling, not parallelism.

Two-core results (hyper-util, 2026-05-14)

For comparison only — production CPU budget should be sized off 1c.

BenchLatency (median)Throughput
http_check_single/2c47.7 µs21 K rps
http_check_throughput/2c/c_10006.52 ms153 K rps
http_check_throughput/2c/c_1000076.7 ms130 K rps
http_check_throughput/2c/c_50000440 ms114 K rps

Second core gains ~14% over 1c at saturation. Single-check latency is slower on 2c (48 µs vs 37 µs) — OS context-switch cost dominates when there’s no parallelism to amortize.

Public status page TTFB (50 orgs × 50 components)

benches/public_status_ttfb.rs provisions a 50-org × 50-component × 60-result fixture in PG + CH then times LiveAggregator::build() for one tenant.

MetricValue
Median14.0 ms
95% CI13.1–15.1 ms
Outliers6/40 (15%) — 3 high severe
Spec target (p99)< 200 ms

Comfortably under target — the (org_id, target_id, ts) ORDER BY on ClickHouse keeps single-tenant lookups bounded; no full-scan regression. Measures the aggregator only — full HTTP TTFB to the client adds template render + serialize + compression (~5–15 ms).

Where the cycles go (historical — reqwest path)

Snapshot kept for context. samply, 15 s sample at 2c/c_10000 on the previous reqwest stack. The largest reqwest-specific cost — 7.5% on url::parse inside reqwest::redirect::TowerRedirectPolicy — disappeared with the hyper-util migration and explains a big chunk of the +44–56% throughput gain documented above.

% of client threadCostNotes
7.5%url::parse via reqwest::redirect::TowerRedirectPolicyURL re-parsed per request even with redirect::Policy::none() — removed post-migration
6.5%kevent syscalltokio io driver poll — inherent
6.3%_platform_memmoveh2 frame buffer copies — inherent
5.0%mach_absolute_timetokio timer + criterion clock
2.4%hyper_util::Client::send_requestrequest dispatch
1.5%h2::HeaderBlock::into_encodingHPACK encode
1.5%pthread_mutex_lockhyper pool mutex
~10% combinedh2 stream bookkeeping (pop/unlink/clone)inherent to multiplexing

Methodology notes

  • target_id is hoisted out of the iter — production uses fixed-per-target UUIDs, so paying Uuid::now_v7’s getentropy syscall per call would add ~10 µs of bench-only noise.
  • Mock returns &'static str — no JSON, no allocation, no body parsing. Isolates client-side cost.
  • No TLSverify_tls: false, plain http://. TLS handshake amortizes over h2 connection reuse; not in this bench.
  • HTTP/2 prior-knowledge (RFC 7540 §3.4) — single TCP connection multiplexes streams. Without it the bench would exhaust loopback ephemeral ports past c≈10000 on macOS.
  • Loopback only. Real network adds RTT (dominates everything here) plus DNS + TCP connect + TLS on first request per host.

Reproducibility caveats

  • macOS: no CPU isolation; Spotlight / Time Machine / runaway processes show as 5–10% outliers
  • Linux: taskset -c 0 pins the bench process to a single core for cleaner 1c numbers
  • Apple Silicon: P-core vs E-core scheduling is opaque; results can shift ~5% run-to-run

For production capacity planning use the single-core throughput above and multiply by your CPU budget. Empirical scaling stays sub-linear past ~4c due to shared h2 connection state and pool mutex contention.

Troubleshooting

/readyz returns 503

The target store can’t be reached. Check storage.postgres.url and that Postgres is up. The readiness probe pings the store; liveness (/healthz) does not.

No metrics on /metrics

  • Confirm observability.metrics_enabled = true
  • Confirm metrics_bind isn’t blocked by a local firewall
  • uptimepage_build_info is emitted at startup so the endpoint is never truly empty — if it’s also missing, the metrics exporter never bound

Many storage_dropped_total{reason="queue_full"}

The result channel between worker pool and batcher is back-pressured.

  • Raise storage.clickhouse.buffer_size (mpsc capacity)
  • Raise storage.clickhouse.batch_size (fewer round-trips per batch)
  • Lower storage.clickhouse.batch_timeout_ms (more frequent flushes)
  • Or lower check frequency for the busiest targets (interval per target)

Circuit breaker stuck open

Look at uptimepage_checks_errors_total{kind} filtered by host to find the failure mode, then wait circuit_breaker.open_duration_secs for the breaker to enter half-open and probe.

Targets reporting degraded with throttled: host concurrency cap

One tenant has more concurrent monitors at the same (host, port) than checker.per_host_max_inflight allows (default 2). Over-cap checks are recorded degraded instead of running. No alert fires — the upstream is fine. Either spread the targets across more hosts, raise the cap, or rely on jitter to thin the burst. Watch uptimepage_host_throttle_drops_total to size the cap against real traffic.

domain_expiry results show served_stale: …

The fresh RDAP probe failed (throttle, timeout, registry 5xx, network blip) but the executor served the most recent successful answer from domain_expiry_state instead of flipping the monitor red. The status reflects the cached expiry_at. For Up the error field stays empty (the customer-facing surface shows nothing unusual); for Degraded/Down it carries served_stale: last_verified_age_secs=…; refresh_failed=<kind> plus the cached details so operators can distinguish a stale serve from a fresh probe.

Inspect the failure kind via uptimepage_domain_expiry_stale_served_total{kind}:

  • kind="throttled" — per-TLD RDAP bulkhead rejected this probe. Raise checker.rdap_max_inflight if rampant, but the cap is also the IANA-friendliness lever.
  • kind="timeout" — the registry took longer than check.timeout (per-target). Either bump the per-check timeout or wait — most registries recover in minutes.
  • kind="lookup_error" — registry returned a non-2xx (often 404 or 5xx). If a specific TLD is stuck on 5xx, the registry is having an incident; rows keep streaming as served_stale until 7 days have passed.
  • kind="fresh_error" — no usable last-good (first probe, or the cached row is older than 7d). A real CheckStatus::Error is emitted and is alert-eligible.

domain_expiry results have flipped to real Error after days of served_stale

The cached row in domain_expiry_state is older than the 7-day staleness ceiling, so the executor stopped masking the registry outage. Either the registry has been down for that long (act on it), or this target’s interval is so long that probes haven’t run in a week. Check last_success_at in domain_expiry_state for the target.

TLS errors against internal hosts

Set verify_tls: false on the offending target. The check executor picks between a verifying and a non-verifying hyper-util client based on the flag — both share the same DNS cache and connection-pool sizing.

400 Bad Request on POST /targets — target address ... is in a blocked range

SSRF guard rejected the target. The URL or TCP host resolves to a private / loopback / link-local / reserved IP. Verify the resolved address is what you expect. To monitor private infrastructure deliberately, set security.allow_private_targets = true and ensure network segmentation prevents abuse.

Check fails with all resolved addresses for 'host' are in blocked ranges

DNS returned only private IPs for a target the API previously accepted (hostname literal). Either the record changed or DNS rebinding is in play. The connect-time guard refuses to continue. Either fix DNS or, deliberately, enable security.allow_private_targets.

credential decryption failed errors in logs

The KEK loaded at startup can no longer decrypt rows written with a different KEK. Either security.credentials_kek_base64 was rotated without re-encrypting existing rows, or the wrong key was supplied. Compare the configured KEK against the one used to write the affected targets — there is no automatic rotation. Recovery options:

  • Restore the original KEK.
  • Delete and re-create the affected targets (the row decrypts cleanly when overwritten via PATCH or POST under the new key).

Startup fails with invalid credentials_kek_base64

The supplied key is not 32 bytes after base64 decode, or the string is not valid base64. Generate a fresh key with openssl rand -base64 32. URL-safe and standard base64 both decode.

400 Bad Request on PATCH /targets/{id} — basic_auth contains redaction sentinel

A client read the target back (where credentials are returned as "***") and PATCHed the full check body without re-supplying the real credential. Either send the real value, or omit check entirely from the PATCH body if only other fields are changing.

429 Too Many Requests on /api/v1/*

Per-IP rate limiter is active and the bucket is empty. Read the Retry-After header for the wait time, or raise api.rate_limit.{per_second, burst}. If every client appears to share one bucket, the service is sitting behind a reverse proxy and the peer IP is the proxy — disable the in-app limiter (api.rate_limit.enabled = false) and let the proxy enforce per-client limits instead.

ClickHouse insert fails with SchemaMismatch

Almost always a Row-derive mismatch on UUID, Enum8, or DateTime64 column types:

  • UUID columns require #[serde(with = "clickhouse::serde::uuid")] on the field
  • Enum8 columns require an i8 field, not &str
  • DateTime64 filter binds in WHERE clauses need wrapping in fromUnixTimestamp64Milli(?) — raw i64 won’t coerce to DateTime64 in CH expressions

Loadtest reports connect errors at high concurrency

Loopback ephemeral port exhaustion or kernel SYN backlog overflow. See loadtest.md — set MOCK_PORTS=64, RAMP_SECS=30, or enable HTTP2=1.