uptimepage
Async Rust service that runs HTTP and TCP health checks against a configurable set of targets, applies per-host circuit breaking, batches results, and ships them to durable storage. Targets persist in PostgreSQL; check results land in ClickHouse for high-cardinality time-series queries. Exposes a REST API for target CRUD and result queries, a server-rendered operator UI on the same port, and Prometheus metrics on a separate port.
Built on Rust 1.95 (edition 2024), Tokio, Axum, hyper-util (custom phase-timing connector + tokio-rustls), sqlx, and the official clickhouse crate. UI layer uses askama 0.16 + HTMX 2 + Tailwind 4 + ECharts 6, all served from the same binary. Designed for low-overhead checks at ~50k concurrent in-flight.
Where to start
- New to the project → Architecture for the big picture
- Integrating → REST API, or the MCP server for LLM clients
- Browsing the data → Web UI
- Running it → Deployment and Configuration
- Operating it → Metrics & tracing and Troubleshooting
- Benchmarking → Benchmarks (per-check micro) and Load test (end-to-end)
Source
github.com/uptimepage/uptimepage
Architecture
Goals
- Run periodic HTTP + TCP health checks against an arbitrary, mutable set of targets
- Stay below 50 ms p99 overhead per check (excluding network)
- Sustain ~50k concurrent in-flight checks per node
- Survive transient target failures (per-host circuit breakers) and storage flaps (in-process retry + batching)
- Graceful shutdown within 10 s without losing in-flight results
Module layout
src/
├── api/ REST handlers, router, OpenAPI doc, middleware
│ ├── docs.rs utoipa OpenApi descriptor (/api/openapi.json + /docs SwaggerUI)
│ ├── error.rs ApiError envelope + stable error code constants
│ ├── handlers/ one module per resource (targets, results, tags, dashboard, health)
│ ├── idempotency.rs DashMap-backed 24h cache + middleware for bulk + bulk-action
│ ├── middleware.rs charset=utf-8 rewriter
│ ├── page.rs PageEnvelope<T> + PageOfTarget / PageOfCheckResult / PageOfIncident / PageOfTagCount
│ ├── redaction.rs credential redaction wrapper
│ ├── routes.rs build_router + per-route layer wiring
│ └── types.rs wire types not in domain/ (TagCount, DashboardSummary, BulkActionRequest, TestRequest, ...)
├── app.rs AppState (storage + worker pool + caches)
├── bin/loadtest.rs in-process load test driver
├── config.rs typed configuration + env override loader
├── domain/ Target, CheckSpec, CheckResult, Incident + coalescing helper
├── error.rs AppError + IntoResponse → ApiError envelope
├── http_client/ custom hyper-util client + phase-timing connector + hickory resolver
├── observability/ tracing + Prometheus + OTLP setup
├── pipeline/ result batcher
├── scheduler/ target registry + per-target tick loop
├── storage/ Postgres (targets) + ClickHouse (results) + in-memory test doubles
├── web/ askama 0.16 + askama_web HTML routes (dashboard, targets, forms, error pages)
│ ├── routes.rs Router<AppState> merged into the main router in main.rs
│ ├── assets.rs rust-embed handler for /static/* with cache-control
│ ├── auth.rs session cookie scaffolding (v1.1 — no-op today)
│ ├── error.rs AppError → HTML error page mapper (not the JSON envelope)
│ └── views/ one module per page (dashboard, targets_list, targets_detail, targets_form)
└── worker/ worker pool + circuit breaker + check executors
templates/ askama HTML (compiled into the binary)
└── ... base.html, dashboard{,/region}.html, targets/{list,detail,form}.html, error/{404,500,503}.html
static/ rust-embed bundle
├── css/ Tailwind 4 build output (built by build.rs)
└── js/ HTMX 2 + json-enc + ECharts 6 + tiny UI/chart modules under ui/ and charts/
The web layer is a thin server-rendered surface on top of the existing JSON API: every UI mutation hits /api/v1/* (forms post JSON, list/detail uses HTMX swaps of partials). See ui.md for operator-level details.
Data flow
┌────────────────┐
│ REST API │ target CRUD
│ (axum + AppState)
└────────┬───────┘
│ writes
▼
┌────────────────┐
│ PostgreSQL │ target metadata
└────────┬───────┘
│ TargetRegistry.refresh() every N seconds
▼
┌────────────────┐
│ Scheduler │ one task per target, jittered tick
└────────┬───────┘
│ dispatch
▼
┌────────────────┐
│ WorkerPool │ semaphore-bounded, circuit-breaker-gated
│ ├── http_check (hyper-util + hickory DNS)
│ └── tcp_check (tokio::net::TcpStream)
└────────┬───────┘
│ CheckResult on mpsc channel
▼
┌────────────────┐
│ ResultBatcher │ size + timeout flush
└────────┬───────┘
│ write_batch
▼
┌────────────────┐
│ ClickHouse │ check_results + 1-min agg MV
└────────────────┘
On-demand checks (POST /targets/{id}/check-now and POST /targets/test) are dispatched
to an agent in the target’s region over the agent’s held long-poll, and the request waits
for the result. The agent persists check-now results (test results are returned but not
stored). If no agent is currently serving the region the request returns 503 PROBE_UNAVAILABLE.
Key design choices
- Two storage backends. Targets are low-cardinality, mutated by API operations → relational (Postgres) is the right fit. Results are append-only, high-cardinality, queried by time range → columnar (ClickHouse) keeps queries fast at 90-day retention.
- Fresh-connect HTTP checks, two TLS modes.
HttpClientsholds tworustlsTlsConnectors — verifying and insecure — plus the shared DNS cache and SSRF guard. There is no connection pool: a monitor probes each target once per interval (a pool rarely reused a socket), and connecting fresh per check is what lets the probe time DNS resolve, TCP connect, and TLS handshake separately (timed_connectinsrc/http_client/connector.rs) and write those phases into each result. The request runs overhyper::client::conn(h1/h2 by ALPN); the connection task is aborted once the body is read. Per-targetverify_tlspicks the connector at dispatch time. - Per-host circuit breakers. Failing hosts open their breaker quickly; subsequent checks fail fast with
error=circuit_openwithout consuming a worker slot. Half-open probes afteropen_duration_secs. - Per-tenant host throttle (bulkhead). A fail-fast semaphore caps how many in-flight checks one tenant can run against the same
(host, port). Bursts beyond the cap are recorded asdegradedwitherror="throttled: host concurrency cap"and do not fire alerts — the upstream is fine, the back-pressure is operator-side. The cap is keyed per-tenant so one customer’s burst can never starve another’s monitor of the same host. RDAP carries its own per-TLD cap so one slow registry can’t correlate failures across every customer’s daily domain-expiry check. - Sticky last-good for domain-expiry probes. Each successful RDAP probe writes
(expiry_at, registrar, last_success_at)todomain_expiry_state(PKtarget_id, denormalisedorg_id, FK CASCADE on the target). Every trait method requiresOrgIdand the row is filtered by both keys — a handler takingtarget_idfrom request input cannot read another tenant’s row. A subsequent transient failure — RDAP timeout, throttle drop, registry 5xx, 404 — does not flip the monitor: the executor reads the cached row and emits aCheckResultwith the cached verdict. For Up theerrorfield stays empty; for Degraded/Down it carries aserved_stale: …annotation, so operators can tell the surface from a fresh probe. Cached rows older than 7d (measured againstlast_success_at, never advanced by failures) escalate toError, which is alert-eligible. Cross-tenant singleflight (keyed by canonical domain) collapses concurrent probes for the same domain to one outbound request — RDAP is public registry data, coalescing across tenants is safe and IANA-friendly. - Bounded result channel. The mpsc between worker pool and batcher has a fixed buffer (
storage.clickhouse.buffer_size). When full, the worker incrementsstorage_dropped_total{reason="queue_full"}and drops the result. Back-pressure is explicit, not hidden. - Idempotent migrations. Postgres uses
sqlx::migrate!(tracked in_sqlx_migrations). ClickHouse migrations are bareCREATE TABLE IF NOT EXISTSstatements run at startup. No external migrator. - Shared DNS cache. A single hickory resolver instance is invoked directly by
timed_connect; lookups cache per RFC TTL plus configurable bounds. Per-resolution latency is recorded intocheck_dns_ms. - Cancellation tokens for shutdown. The root token is cloned to scheduler, batcher, sampler, idempotency pruner, and graceful axum shutdown. SIGINT/SIGTERM cancels root; subsystems drain in
tokio::join!. - Self-describing API.
utoipaderives an OpenAPI 3.1 document at compile time, exposed at/api/openapi.jsonand rendered at/docsvia Swagger UI. Every handler annotation carries at least one example. The 4xx/5xx error envelope and the listPageEnvelopeare unified across every endpoint. - In-process caches with bounded TTL. The dashboard summary holds a 5-second
parking_lot::Mutex<Option<(Instant, DashboardSummary)>>to absorb operator polling. TheIdempotency-Keycache is aDashMapkeyed by(header, body-hash)with a 24-hour TTL; a background pruner sweeps expired entries hourly. - Incident coalescing. A shared helper in
domain/incident.rsconsumes ordered(timestamp, status, error)tuples and emitsIncidentrows. Memory + ClickHouse storage call into the same logic; the ClickHouse path uses a narrow column projection to keep bandwidth low.
Concurrency model
- One Tokio runtime, multi-threaded scheduler (default
worker_threads = num_cpus) - One Tokio task per active target in the scheduler — sleeps
interval ± jitter, dispatches, sleeps again WorkerPool::executespawns a new task per dispatch, gated byArc<Semaphore>sized tomax_concurrent_checks- Batcher is a single task with
tokio::select!over channel-recv, timeout, and cancellation - Sampler is a single task that periodically reads gauge sources (pool semaphore counts, target count, breaker counts) and records into the metrics registry
Multi-region probes
By default one process is the whole system: it schedules and runs every check itself, in one region. A deployment can add regions by running extra processes as agents ([agent] enabled = true) — stateless probes with no database, web, or alerting. Each agent pulls its region’s decrypted monitor config from the control plane and POSTs results back; region is the partition key, so one agent per region needs no coordination. The control plane’s own region is a normal region row (scheduler.region), not a sentinel. Results carry their region + agent through both ClickHouse rollups, so reads can slice by region. Regions and agents are provisioned through the instance-admin /operator/* surface. See Multi-region probes for the full model, operator surface, and read-path behaviour.
REST API
Mounted under /api/v1 on the configured API bind. JSON in, JSON out. No authentication in v1 — bind to loopback or front it with a reverse proxy you trust.
OpenAPI 3.1 document at GET /api/openapi.json; Swagger UI at GET /docs.
All responses use Content-Type: application/json; charset=utf-8.
Response headers
POST /api/v1/targets(201) setsLocation: /api/v1/targets/{id}so clients can follow up without re-deriving the path.Cache-Controlis stamped on every/api/v1/*response:- mutations (POST / PATCH / DELETE) →
no-store /api/v1/dashboard/summary→private, max-age=5(matches the server-side cache)- all other reads →
private, max-age=10
- mutations (POST / PATCH / DELETE) →
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/targets | create one target |
POST | /api/v1/targets/bulk | bulk-create up to 10,000 targets |
POST | /api/v1/targets/bulk-action | enable / disable / delete / tag-add / tag-remove on many ids |
POST | /api/v1/targets/test | run a one-shot check against a CheckSpec without persisting |
POST | /api/v1/targets/{id}/check-now | run an immediate check using the target’s stored credentials |
GET | /api/v1/targets | list targets (limit, offset, tag, enabled, q) — paginated |
GET | /api/v1/targets/{id} | get one target |
PATCH | /api/v1/targets/{id} | update name, check spec, interval, enabled, tags |
DELETE | /api/v1/targets/{id} | delete a target |
GET | /api/v1/targets/{id}/results | recent check results (from, to, limit, offset, region) — paginated |
GET | /api/v1/targets/{id}/latency | bucketed latency series (from, to, region) — server-side quantiles + per-phase means |
GET | /api/v1/targets/{id}/latency/by-region | per-region latency series (from, to) — one series per region, for overlay charts |
GET | /api/v1/targets/{id}/uptime | uptime summary over a range (from, to, region) |
GET | /api/v1/targets/{id}/regions | list the regions a monitor probes from |
PUT | /api/v1/targets/{id}/regions | set the regions a monitor probes from |
GET | /api/v1/regions | list the enabled probe-region catalog (id, name, location) |
GET | /api/v1/targets/{id}/incidents | coalesced incident periods (from, to, ongoing_only) — paginated |
POST | /api/v1/targets/{id}/shares | mint a read-only share link; returns the share (token included) |
GET | /api/v1/targets/{id}/shares | list a monitor’s live share links (token included, re-copyable) |
DELETE | /api/v1/targets/{id}/shares/{share_id} | revoke a share link |
GET | /api/v1/tags | tag inventory with target counts (q prefix) — paginated |
GET | /api/v1/dashboard/summary | per-org rollup (5-second in-process cache, keyed by OrgId) |
GET | /healthz | liveness — always 200 once the process is up |
GET | /readyz | readiness — pings the target store; 503 if unreachable |
GET | /api/openapi.json | OpenAPI 3.1 document |
GET | /docs | Swagger UI |
Instance-admin and agent surfaces
Two surfaces sit outside /api/v1 with their own auth, used only for multi-region deployments:
/operator/*— instance-admin regions + agents CRUD, gated by a static bearer secret (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN);404s when unset./api/agent/*— the pull/ingest endpoints an agent uses, authenticated by itssm_agent_…token (not a tenantapi_token).
Both are documented in Multi-region probes.
Operator endpoints (maintenance + incident narration)
These mutate the public surface; they live under the same auth boundary as
/api/v1/targets. Operator workflow + validation rules in
Public status page.
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/maintenance | schedule a maintenance window |
GET | /api/v1/maintenance | list windows (status=active|upcoming|past|all, paginated) |
GET | /api/v1/maintenance/{id} | get one window |
PATCH | /api/v1/maintenance/{id} | edit title / description / time range / components (rejected after ends_at) |
DELETE | /api/v1/maintenance/{id} | cancel a window |
PATCH | /api/v1/incidents/{id} | update narration: public_title, public_description, severity (JSON null clears, omit to leave alone) |
POST | /api/v1/incidents/{id}/updates | append a status update — phase ∈ investigating/identified/monitoring/resolved/postmortem, message ≤ 2 000 chars |
Operator endpoints (status pages)
An org owns one or more public status pages, each with its own slug, branding,
and curated set of monitors. Reads are open to any active member; every mutation
is owner-only. Scoped to the caller’s active org (a foreign page id is 404).
Adding a monitor already on the page returns 409 COMPONENT_ALREADY_ON_PAGE —
edit it with PATCH. Model + caps in Per-org status pages.
| Method | Path | Purpose |
|---|---|---|
GET | /api/v1/status-pages | list this org’s pages |
POST | /api/v1/status-pages | create a page (capped at max_status_pages; slug globally unique) |
GET | /api/v1/status-pages/{id} | one page + its live URL and logo URL |
PATCH | /api/v1/status-pages/{id} | rename, change slug, publish/unpublish, edit branding |
DELETE | /api/v1/status-pages/{id} | delete the page |
GET | /api/v1/status-pages/{id}/components | the monitors curated onto the page |
POST | /api/v1/status-pages/{id}/components | add a monitor (distinct-target cap max_public_components) |
PATCH | /api/v1/status-pages/{id}/components/{target_id} | per-page public_name / public_description / public_group (JSON null clears) |
DELETE | /api/v1/status-pages/{id}/components/{target_id} | remove a monitor from the page |
POST | /api/v1/status-pages/{id}/components/reorder | set component order |
POST | /api/v1/status-pages/{id}/logo | upload a logo (multipart) |
DELETE | /api/v1/status-pages/{id}/logo | remove the logo |
Public status endpoints
Unauthenticated; mounted at /api/public/v1/* and bypassed at Caddy via the
@public matcher (see Deployment).
Each response carries Cache-Control: public, max-age=10, stale-while-revalidate=30. A monitor not curated onto the page being
served is invisible on every public surface — direct lookups return 404
and it never appears in any list. Wire types literally cannot serialise
sensitive target fields (url, headers, basic_auth, bearer_token).
| Method | Path | Purpose |
|---|---|---|
GET | /status | server-rendered HTML status page (?fragment=1 returns the dynamic region only) |
GET | /status/incidents/{id} | per-incident detail page |
GET | /api/public/v1/status | the same data as /status in JSON |
GET | /api/public/v1/components/{id}/history | per-component 90-day history (days query, default 90, max 90) |
GET | /api/public/v1/incidents | recent public incidents (paginated) |
GET | /api/public/v1/incidents/{id} | one public incident with its update timeline |
GET | /api/public/v1/incidents.rss | RSS 2.0 feed of recent incidents |
GET | /api/public/v1/maintenance | active + upcoming maintenance windows |
GET | /api/public/v1/badge.svg | embeddable SVG status badge (overall, or ?component={id}) |
See Public status page for the operator workflow and
the per-page component fields (public_name, public_description,
public_group, sort_order) that drive what’s published.
Operator endpoints (share links)
A share link is a capability URL that renders one monitor’s full read-only detail view to anyone who has it, no account. Managing share links — mint, list, revoke — is a monitor action gated on member-level targets:write (not owner-only); listing returns the live token so a read-only caller can’t harvest working public links. Scoped to the caller’s active org (a foreign monitor id is 404). expires_at is optional; omit it for a link that never expires. The public surface those tokens unlock is documented in Share links.
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/targets/{id}/shares | mint a share; body { "label"?, "expires_at"? }, returns the MonitorShare |
GET | /api/v1/targets/{id}/shares | list live (non-revoked) shares |
DELETE | /api/v1/targets/{id}/shares/{share_id} | revoke immediately — the link 404s on its next request |
Both POST and GET return the token; build the link as /m/{token} (prepend your origin). The token stays re-copyable — it is stored encrypted at rest (the app KEK, same as basic_auth/bearer_token); the public resolve path matches on a separate hash, so a hot link never triggers a decrypt. token is null only when a row was sealed under a KEK that is no longer configured. Two plan caps apply (columns on plans, overridable per-org via plan_overrides): max_share_links_per_monitor (active links on one monitor) and max_shared_monitors (distinct monitors in the org that have any link). The free plan is 1 and 2. Exceeding either is 422 QUOTA_EXCEEDED (the body names the quota). A label longer than 80 characters is 400 SHARE_LABEL_INVALID; an expires_at in the past is 400 INVALID_EXPIRY.
Check specs
Tagged enum, type discriminator.
HTTP
{
"type": "http",
"url": "https://example.com/healthz",
"method": "GET",
"timeout": 5000, // ms, total request budget
"follow_redirects": false,
"max_redirects": 0,
"expected_status": { "kind": "exact", "value": 200 },
"expected_body_contains": null, // optional substring match
"headers": {},
"body": null,
"verify_tls": true,
"basic_auth": null, // ["user", "pass"] or null
"bearer_token": null
}
Credential redaction
GET, POST, PATCH, and bulk responses replace populated basic_auth / bearer_token fields with the sentinel "***". A null field stays null, so clients can distinguish “auth is configured” from “no auth”. When you PATCH a target’s check, you must re-supply the real credential — a body that contains "***" is rejected with 400 Bad Request. If you only need to change other fields (name, tags, enabled, interval), omit check from the PATCH body. Encryption at rest is gated on security.credentials_kek_base64; the redaction behavior applies in either mode.
expected_status variants:
{ "kind": "exact", "value": 200 }
{ "kind": "range", "value": { "min": 200, "max": 299 } }
{ "kind": "one_of", "value": [200, 204] }
Rate-limited responses
A response with 429 Too Many Requests or 503 Service Unavailable is recorded as degraded, not down — the upstream is telling us “I’m here, back off.” The error field carries rate-limited <code> (Retry-After: <value>) when the header is present so operators can size the polling interval against what the upstream actually wants. A check that explicitly accepts 429 / 503 via expected_status is honored first and stays up.
Some third-party APIs rate-limit by source IP regardless. GitHub’s unauthenticated REST API is the canonical case: 60 req/h per IP, 5 000 req/h with a token in the Authorization header. Poll those endpoints at ≥ 300 s, or attach the token via a header in this spec.
Per-host throttle
The worker side caps the number of concurrent checks one tenant can fan at the same (host, port) so a burst of monitors against one upstream doesn’t look like a probe. When the cap is reached, the over-cap check is recorded as degraded with error="throttled: host concurrency cap" and no alert fires — the upstream is fine, the back-pressure is operator-side. The cap is per-tenant: one customer’s burst never starves another customer’s monitor of the same host. Default cap is two in-flight per (org, host, port); tune via checker.per_host_max_inflight. RDAP queries (domain expiry) carry their own per-TLD cap via checker.rdap_max_inflight.
TCP
{ "type": "tcp", "host": "db.internal", "port": 5432, "timeout": 2000 }
TLS certificate expiry
{
"type": "tls_cert",
"host": "example.com",
"port": 443,
"server_name": null, // optional SNI override; defaults to `host`
"warn_days": 14,
"critical_days": 7,
"timeout": 5000
}
Opens a TCP connection, performs a TLS handshake against the host (accepting any presented chain so that expired or self-signed certs can still be inspected), and parses the leaf certificate’s notAfter. Status mapping:
days_remaining < 0(expired) →downdays_remaining < critical_days→downdays_remaining < warn_days→degraded- otherwise →
up
error carries a JSON document with days_remaining, not_after, subject_common_name, issuer_common_name. A handshake failure (plain-TCP host, network error) returns error status with the underlying message. warn_days must be strictly greater than critical_days. Floor is interval >= 3600 (enforced); default for a new monitor is 86400 (daily).
Domain expiration
{
"type": "domain_expiry",
"domain": "example.com",
"warn_days": 30,
"critical_days": 7,
"timeout": 10000
}
Queries the IANA RDAP bootstrap registry to find the authoritative RDAP server for the domain’s TLD, then fetches /domain/<domain> and reads the events[?eventAction == "expiration"] entry. Status mapping is the same as TLS cert: < critical_days → down, < warn_days → degraded, else up. Non-up results carry a JSON error body with domain, days_remaining, expiration_date, and (when present) registrar.
The bootstrap registry is fetched lazily on the first lookup and cached for the lifetime of the process. The SSRF guard does not apply — the check’s network destination is an IANA-published RDAP server, not the user-supplied domain. Floor is interval >= 3600 (enforced); default for a new monitor is 86400 (daily). RDAP servers rate-limit clients — keep this near daily, not hourly. warn_days must be strictly greater than critical_days.
Target payload
{
"name": "internal-api",
"check": { /* check spec */ },
"interval": 60, // seconds between ticks; effective floor is
// max(plan.min_check_interval_secs, kind_min).
// kind_min is 10 for http/tcp/dns and 3600 for
// tls_cert/domain_expiry. Plan-free min = 60.
// 10 is the absolute DB CHECK hard floor.
"enabled": true,
"tags": ["prod", "tier1"],
"alerts": { /* optional, see below */ }
}
Server returns the full Target including id (UUIDv7), created_at, updated_at, and write_source.
write_source is a read-only field recording where the resource was last
written from: ui, api, or terraform (decided server-side from the
request, never the body — sending it is ignored). It also appears on
notification channels and maintenance windows, and drives the “managed by”
badge in the web UI. A write through any endpoint restamps it, so it reflects
the most recent author.
Alert config
alerts is an optional array of channel bindings. Each binding is just a
reference to a notification channel (see
Notification channels); the firing policy lives on
the monitor itself. An empty/omitted array disables channel alerting for that
target (incidents still open and show on status pages).
"alerts": [
{ "channel_id": "0192a1ce-0000-7000-8000-000000000001" },
{ "channel_id": "0192a1ce-0000-7000-8000-000000000002" }
],
"alert_confirmations": 3,
"notify_recovery": true,
"renotify_interval_secs": 3600,
"region_policy": "majority"
channel_id— id of a notification channel owned by the same org. A binding to an unknown or another tenant’s channel is rejected.alert_confirmations— consecutive failing checks before an incident opens (and the same number of passing checks before it closes, which damps flapping). Default2, must be>= 1.notify_recovery— whentrue(default), the recovery is announced to the monitor’s channels. Whenfalse, recovery is silent.renotify_interval_secs— seconds between reminder notifications while an outage stays unacknowledged.0disables reminders; otherwise must be>= 60. Default3600. Acknowledging or resolving the incident stops the reminders.region_policy— how many probe regions must agree the target is down before an incident opens:"any","majority"(default),"all", or{ "count": N }.
Notifications are driven by the incident engine: one notification per
incident open (then reminders per renotify_interval_secs), one on recovery.
Failed deliveries retry on exponential backoff and dead-letter after the
attempt cap; per-incident delivery state is visible at
GET /api/v1/incidents/{id}/notifications.
Alert validation errors
POST and PATCH return 400 Bad Request (INVALID_ALERT_CONFIG) for:
- a duplicate
channel_idin the array notification channel <id> does not exist— unknown id, or one owned by another orgalert_confirmations must be >= 1renotify_interval_secs must be 0 (off) or at least 60
A region_policy of { "count": N } where N is 0 or exceeds the
available regions is 422 INVALID_REGION_POLICY.
Validation errors
POST and PUT return 400 Bad Request for:
- Unsupported URL scheme (
url scheme '...' not allowed— onlyhttpandhttps) - Missing URL host, empty TCP host, or TCP/TLS port
0 tls_cert warn_days must be > critical_daysdomain_expiry domain must contain a TLD label(no dot indomain)domain_expiry warn_days must be > critical_days- SSRF guard —
target address ... is in a blocked range. Triggered when the URL or TCP host is an IP literal that resolves to loopback / private / link-local / reserved space (see Configuration →security.allow_private_targets). Hostname literals are checked again at connect time after DNS resolution, so DNS rebinding cannot bypass the guard. - Redaction sentinel —
basic_auth contains redaction sentinel — re-supply the real credentialor the equivalent forbearer_token. Rejected to prevent aGET→PATCHround-trip from silently overwriting the stored credential with"***". - TLS verification + credentials —
verify_tls = false cannot be combined with basic_auth or bearer_token over https. When verification is disabled any host presenting a forged certificate can collect the stored credential on every check interval. Setverify_tls = true(recommended) or remove the credential from the target.
Notification channels
Per-org delivery destinations that targets bind to via their alerts array.
Org scoping is implicit in the caller’s authenticated context — one tenant can
never read, mutate, or test another’s channels.
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/notification-channels | Create a channel (201 + Location) |
GET | /api/v1/notification-channels | List the org’s channels |
GET | /api/v1/notification-channels/{id} | Get one |
PATCH | /api/v1/notification-channels/{id} | Partial update |
DELETE | /api/v1/notification-channels/{id} | Delete (204); also removes the channel’s alert bindings from every monitor |
POST | /api/v1/notification-channels/test | Test an unsaved transport config |
POST | /api/v1/notification-channels/{id}/test | Send a synthetic test alert through a saved channel |
POST | /api/v1/notification-channels/{id}/resend-verification | Resend the verification mail for an unverified email channel |
{
"name": "Ops Slack",
"enabled": true,
"config": { "type": "slack", "webhook_url": "https://hooks.slack.com/services/T/B/XXXX" }
}
config is type-tagged. Supported transports:
slack—{ "type": "slack", "webhook_url": "https://…" }(incoming webhook; posts{ "text": "…" })discord—{ "type": "discord", "webhook_url": "https://discord.com/api/webhooks/…" }(channel webhook; posts{ "content": "…" }with?wait=trueso delivery failures surface synchronously; text capped at 2000 chars)msteams—{ "type": "msteams", "webhook_url": "https://….logic.azure.com/…" }(Teams Workflows webhook; posts an Adaptive Card. Retired O365 connector URLs are not accepted)google_chat—{ "type": "google_chat", "webhook_url": "https://chat.googleapis.com/v1/spaces/…" }(space webhook; posts{ "text": "…" }, capped at 4096 chars)webhook—{ "type": "webhook", "url": "https://…", "headers": { … }, "secret": "…" }(POSTs the alert JSON; optional custom headers; optional signing secret, see below). The escape hatch: no host restrictions, for services the named kinds don’t covertelegram—{ "type": "telegram", "bot_token": "…", "chat_id": "…" }(bring-your-own bot)telegram_app—{ "type": "telegram_app", "chat_id": "…", "chat_title": "…" }— linked through the platform’s central bot. Not creatable from request bodies: aPOST/PATCH/test carrying this kind returns422 CHANNEL_KIND_MANAGED(the chat id rides the operator bot’s credentials, so accepting one would let any caller page an arbitrary chat). Channels of this kind are created only by the link-code flow below.whatsapp—{ "type": "whatsapp", "access_token": "…", "phone_number_id": "…", "to": "…", "template_name": "…", "language_code": "en" }(Business Cloud API;language_codeoptional, defaulten)whatsapp_app—{ "type": "whatsapp_app", "phone": "…", "profile_name": "…" }— linked through the platform’s WhatsApp number. Not creatable from request bodies (422 CHANNEL_KIND_MANAGED, same rationale astelegram_app); created only by the WhatsApp link-code flow below.pagerduty—{ "type": "pagerduty", "routing_key": "…" }(the 32-character Events API v2 integration key of a PagerDuty service). The only transport that drives the destination’s own incident lifecycle: opens/reopens/escalations sendtriggerand resolution sendsresolve, all correlated bydedup_key= the incident id, so one uptimepage incident maps to exactly one PagerDuty alert that opens and closes with it. Severity maps Critical→critical, Major→error, Minor→warning. A test send fires atrigger+resolvepair on a throwaway dedup key and never leaves an open PagerDuty incidentntfy—{ "type": "ntfy", "server_url": "https://ntfy.sh", "topic": "…", "access_token": "tk_…" }(JSON publish to the server root;server_urloptional, defaults to ntfy.sh, must be the bare server root;access_tokenoptional, sent as a Bearer token). High-urgency opens publish at priority 4, the rest at 3; resolves tagwhite_check_mark, opensrotating_light. On ntfy.sh an unprotected topic’s name is its only access controlpushover—{ "type": "pushover", "token": "…", "user": "…", "device": "…" }(30-character application token and user/group key, both treated as secrets;deviceoptional). High-urgency alerts go out at priority 1 (bypasses quiet hours), low at 0, resolves at −1 (no sound). Emergency priority 2 is not usedsms—{ "type": "sms", "provider": "twilio", "to": "+15551234567", "from": "…", … }— bring-your-own SMS gateway; one text message per alert, body trimmed to a few segments to bound per-segment cost.tois E.164;fromis an E.164 number or sender id. The provider-specific credentials are:twilio→account_sid+auth_token;telnyx→api_key(+ optionalmessaging_profile_id);vonage→api_key+api_secret;plivo→auth_id+auth_token;sinch→service_plan_id+api_token+region(us/eu/au/br/ca, defaultus). Only the gateway secret is treated as a secret (Twilio/Plivoauth_token, Telnyxapi_key, Vonageapi_secret, Sinchapi_token); account identifiers stay visibleemail—{ "type": "email", "to": "oncall@example.com" }— one lowercase address per channel, delivered through the platform’s transactional sender. Verification-gated: the channel is created unverified and a mail with a single-use 24 h link is sent to the address; until the link is confirmed every delivery (incident page or test send) fails withemail address not verified. Replacing the config resets the gate and re-sends the mail.POST /api/v1/notification-channels/{id}/resend-verificationre-sends it (capped per channel and per org per day —422 CHANNEL_VERIFICATION_LIMIT; on a non-email channel —422 CHANNEL_NOT_VERIFIABLE); a test against an unverified or unsaved email config is422 CHANNEL_UNVERIFIED.
Webhook signing. When a webhook channel carries a secret (≥ 16
characters), every delivery is signed: the request includes
X-Uptimepage-Timestamp (unix seconds) and
X-Uptimepage-Signature: sha256=<hex>, where the hex is
HMAC-SHA256(secret, "{timestamp}.{body}") over the exact bytes sent.
Receivers should recompute the digest and reject stale timestamps (e.g.
older than 5 minutes) to block replays. Channels without a secret deliver
unsigned.
WhatsApp templates. Create a one-parameter utility template (body
{{1}}) in the WhatsApp Business Manager and set template_name (plus
language_code, which must match the template’s exact language — en
and en_US are distinct). The alert text is sent as that single
parameter, collapsed to one line. A template is required: WhatsApp
accepts free-form text only within 24 hours of the recipient’s last
message, and out-of-window sends are accepted by the API yet dropped
asynchronously — a silent-loss mode an alerting channel must not have.
Behaviour:
- Secrets sealed at rest with the credentials KEK; never echoed back. Every read path masks secret-bearing fields with
***(the webhook URL is masked whole — it can carry a token; header names andchat_idare kept so the UI stays useful). - Redaction-sentinel guard: submitting a
configthat still contains***returns400 REDACTION_SENTINEL. OmitconfigonPATCHto keep the stored secret unchanged. - Validation (
400): every webhook URL must behttps; the provider-branded kinds are additionally host-pinned (discord→discord.com/discordapp.comwith an/api/webhooks/path,msteams→*.logic.azure.com/*.powerplatform.com,google_chat→chat.googleapis.com) and a URL elsewhere is rejected with a hint to use the genericwebhookkind;telegramrequires non-emptybot_tokenandchat_id;whatsapprequiresaccess_token, a numericphone_number_id, an international-formatto, and atemplate_name(lowercase/digits/underscore);emailrequires a lowercase single-addressto;pagerdutyrequires a 32-char alphanumericrouting_key;ntfyrequires an https root-onlyserver_urland a 1–64 chartopic(letters/digits/_/-);pushoverrequires 30-char alphanumerictokenanduser;smsrequires an E.164to, afrom, and the selected provider’s credentials (Twilioaccount_sidisAC+ 32 hex; Plivoauth_idand Sinchservice_plan_idare alphanumeric; Sinchregionis one ofus/eu/au/br/ca); channelnameis required and ≤ 100 chars. - Destination deny-list: the customer-controlled outbound URL (
slack/discord/msteams/google_chat/webhook/ntfy’sserver_url) is checked against the platform’s abuse deny-list on create, update, and both test endpoints — a match is rejected (ABUSE_BLOCKED/DOMAIN_DENYLISTED).telegram/whatsapp/email/pagerduty/pushover/smsdeliver to fixed vendor endpoints. - Quota: capped per org by the plan’s
max_notification_channels(atomic, advisory-locked). A duplicate name within the org is422 CHANNEL_NAME_TAKEN; the cap is422 CHANNEL_QUOTA_EXCEEDED. - Test sends deliver one clearly-labelled synthetic alert. The per-channel form tests the stored config (works on a disabled channel too); the collection-level
POST …/testtakes{ "config": { … } }in the body, validates it exactly as create would, and persists nothing — the UI uses it for “test now” before a channel is saved. A transport failure is422 CHANNEL_TEST_FAILED. Both count against thetest_nowrate-limit bucket. - Platform disables: when a linked Telegram chat unlinks from its side (the bot is removed, or the chat sends
/stop), every channel linked to that chat is disabled with adisabled_reasonthe UI shows. Re-enabling the channel clears the note.
Telegram one-tap linking
Deployments running the central bot expose a link-code flow (absent — 404 TELEGRAM_LINK_NOT_FOUND — otherwise):
POST /api/v1/notification-channels/telegram-link(channels:write) with an optional{ "name": "…" }hint mints a single-use code (15-minute expiry, capped outstanding codes per org →422 TELEGRAM_LINK_LIMIT). The response carries the rawcode(shown once, only its hash is stored), adeep_link(t.me/<bot>?start=<code>, private chat) and agroup_deep_link(?startgroup=<code>, picks a group). The same code works for either destination.- Sending the code to the bot (tap Start, or
/link <code>in a group) creates thetelegram_appchannel for the minting org. The org is resolved only from the code — never from the Telegram payload. GET /api/v1/notification-channels/telegram-link/{id}(channels:read) polls the code:pending,consumed(withchannel_id), orexpired.- Unlink = delete the channel; deleting the last channel linked to a group also walks the bot out of that group. From the chat side,
/stopor removing the bot disables the channel (see platform disables above).
WhatsApp one-tap linking
Deployments with the operator WhatsApp number enabled expose the same flow (absent — 404 WHATSAPP_LINK_NOT_FOUND — otherwise):
POST /api/v1/notification-channels/whatsapp-link(channels:write) with an optional{ "name": "…" }hint mints a single-use code (15-minute expiry, capped per org →422 WHATSAPP_LINK_LIMIT). The response carries the rawcodeand adeep_link(wa.me/<number>?text=<code>) that opens WhatsApp with the code prefilled.- Sending the prefilled message creates the
whatsapp_appchannel for the minting org, bound to the sender’s number. The org is resolved only from the code — never from the webhook payload. GET /api/v1/notification-channels/whatsapp-link/{id}(channels:read) polls the code:pending,consumed(withchannel_id), orexpired.- Unlink = delete the channel; from the phone side, sending
stopdisables every channel bound to the number (platform disable, reason shown in the UI).
Delegation links
The person who owns the Slack workspace / Telegram group / inbox often isn’t the person configuring monitors — a delegation link hands off just the connect step.
POST /api/v1/notification-channels/delegate(channels:write) with optional{ "name": "…", "kind": "…" }hints mints a single-use/c/<code>URL (7-day expiry, capped outstanding links per org →422 DELEGATE_LINK_LIMIT; unknownkind→400 DELEGATE_KIND_INVALID). Only the code’s hash is stored.GET /c/<code>is public and chrome-less: it offers exactly the connect-capable transports of the deployment — the telegram one-tap link + QR (the delegation code doubles as thet.mestart payload), “add to Slack” / “add to Discord” when the operator OAuth apps are configured, and a manual webhook/address form. The link can create one channel in the inviting org and read nothing; expired, revoked, and spent codes all render the same 404 page. Every delegated create lands in the org audit log.GET /api/v1/notification-channels/delegate(channels:read) lists the org’s links (pending/consumed/expired);DELETE /api/v1/notification-channels/delegate/{id}(channels:write) revokes an unconsumed one (revoked links read as expired).
Rate limiting
/api/v1/* is rate-limited per authenticated subject — by (org, category) and by (user, category), whichever trips first — with the per-minute budgets taken from the org’s plan. Categories: api_writes (POST/PATCH/DELETE), api_reads (GET/HEAD/OPTIONS), bulk_ops (/bulk*), test_now (/test), check_now (/check-now). Exceeding a budget returns 429 Too Many Requests with a Retry-After header (seconds until the next token) and code: RATE_LIMITED. /healthz and /readyz are never throttled. Unauthenticated and per-IP limiting is the reverse proxy’s job (see Deployment). Full model: Quotas & rate limits.
CORS
Disabled by default. When api.cors.enabled = true, /api/v1/* answers preflight OPTIONS with Access-Control-Allow-Origin (matching allowed_origins or * when allow_any_origin = true), Access-Control-Allow-Methods (the configured list), and Access-Control-Allow-Headers: content-type. /healthz and /readyz carry no CORS headers regardless.
Error envelope
Every 4xx and 5xx response uses one wire shape:
{
"error": {
"code": "INVALID_URL_SCHEME",
"message": "url scheme 'ftp' not allowed",
"field": "check.url",
"details": null,
"trace_id": null
}
}
codeis stable, machine-readable, UPPER_SNAKE_CASE. Never repurposed once published.fieldis a JSON pointer to the offending input for 400s;nullfor non-field errors.detailscarries optional structured context (e.g.,{ "range": "127.0.0.0/8" }for SSRF rejections).trace_idis the W3Ctraceparentwhen tracing is enabled.
Common codes: INVALID_URL_SCHEME, INVALID_URL_FORMAT, SSRF_BLOCKED, INVALID_INTERVAL, INVALID_TIMEOUT, INVALID_TCP_PORT, INVALID_TCP_HOST, INVALID_STATUS_RANGE, INVALID_TLS_CERT_PARAMS, INVALID_DOMAIN_PARAMS, INVALID_TLS_CRED_COMBO, INVALID_ALERT_CONFIG, REDACTION_SENTINEL, BULK_EMPTY, BULK_TOO_LARGE, BAD_TIME_RANGE, TARGET_NOT_FOUND, CHANNEL_NOT_FOUND, CHANNEL_NAME_TAKEN, CHANNEL_NAME_INVALID, CHANNEL_QUOTA_EXCEEDED, INVALID_CHANNEL_CONFIG, CHANNEL_TEST_FAILED, CIRCUIT_OPEN, DEPENDENCY_DOWN, INTERNAL.
Quota, rate-limit and abuse codes
| Code | HTTP | Meaning |
|---|---|---|
QUOTA_EXCEEDED | 422 | A plan quota would be exceeded. details carries quota (e.g. max_targets, max_members, max_public_components), current, limit, plan. |
MIN_CHECK_INTERVAL | 422 | Requested check interval is below the effective floor (max(plan.min_check_interval_secs, kind_min)), where kind_min is 3600 for tls_cert / domain_expiry and 10 for http / tcp / dns. Enforced on create, bulk, and PATCH. |
INVITATIONS_LIMIT | 409 | The org is at its pending-invitation cap. |
RATE_LIMITED | 429 | A per-minute rate budget was exceeded. Retry-After (seconds) is set; details.scope names the tier, e.g. per_org_api_writes. |
ABUSE_BLOCKED | 400 | Target blocked by abuse protection. details.reason explains. |
URL_PATTERN_BLOCKED | 400 | Target URL matched an abuse pattern (recon path). |
DOMAIN_DENYLISTED | 400 | Target domain (or a parent) is on the deny-list. |
See Quotas & rate limits for the quota model, the per-minute categories, and the deny-list policy.
Pagination envelope
Every list endpoint returns:
{ "items": [ /* ... */ ], "total": 1240, "limit": 50, "offset": 0 }
limit defaults to 50 for /targets and /tags, 1000 for /results, 100 for /incidents. limit is silently capped server-side (10,000 for results, 1,000 for incidents/tags). total reflects rows matching the filters, ignoring limit/offset.
Results query
GET /api/v1/targets/{id}/results?from=2026-05-12T00:00:00Z&to=2026-05-12T23:59:59Z&limit=100&offset=0
from/todefault to the last 24 h;tomust be strictly greater thanfrom(400BAD_TIME_RANGEotherwise).- Returns a
PageEnvelopeofCheckResultordered bytimestamp DESC.
Latency series
GET /api/v1/targets/{id}/latency?from=…&to=…
Pre-bucketed quantiles and per-phase means read straight from the per-minute rollup — powers the monitor-detail latency line and phase-breakdown area charts. The server divides the range into ~60 slices (floored to the 60-second rollup grain), so any range returns a comparably dense series and the cost stays O(buckets), not O(samples). Switching range re-scales the buckets.
from/todefault to the last 24 h;tomust be strictly greater thanfrom(400BAD_TIME_RANGE).
{
"bucket_seconds": 1440,
"buckets": [
{
"t": 1747137600000, // unix-ms at bucket start (JS new Date(t))
"p50": 120, "p95": 180, "p99": 240,
"avg": 130, // mean total; breakdown chart derives "processing" = avg − (dns+connect+tls+ttfb)
"dns": 12, "connect": 20, "tls": 35, "ttfb": 60, // mean per-phase ms; 0 for kinds that skip the phase
"samples": 24 // 0 marks a gap the chart leaves unconnected
}
]
}
bucket_seconds is always a multiple of 60 (1h→60, 24h→1440, 7d→10080, 30d→43200).
Region filter
results, latency, and uptime accept an optional region= query parameter to scope the read to one probe region; omit it for an all-regions view. Region ids are the slugs registered via the operator surface. See Multi-region probes.
Per-region latency series
GET /api/v1/targets/{id}/latency/by-region?from=…&to=…
Same bucketing and cost as /latency, but split by region so each can be overlaid as its own line — powers the monitor-detail overlay chart. One entry per region that has samples in the range; each region’s buckets use the same shape as /latency.
{
"bucket_seconds": 1440,
"regions": [
{ "region": "default", "buckets": [ /* LatencyBucket… */ ] },
{ "region": "eu-west", "buckets": [ /* LatencyBucket… */ ] }
]
}
Uptime query
GET /api/v1/targets/{id}/uptime?from=…&to=…
{ "total": 8640, "up": 8635, "down": 0, "degraded": 0, "error": 5, "uptime_pct": 99.94 }
Incidents query
GET /api/v1/targets/{id}/incidents?from=…&to=…&ongoing_only=false&limit=100&offset=0
Returns coalesced down / error periods. A contiguous run of bad statuses becomes one incident; an up result between two bad runs splits them. Ongoing incidents return ended_at: null and duration_secs: null.
{
"items": [
{
"id": "01h7m8z4n6v0e1m7v7y6x8x8x8",
"target_id": "01h7m...",
"started_at": "2026-05-13T11:30:00.000Z",
"ended_at": "2026-05-13T11:35:00.000Z",
"status": "down",
"duration_secs": 300,
"check_count": 5,
"error_sample": "connection refused"
}
],
"total": 1, "limit": 100, "offset": 0
}
Tags inventory
GET /api/v1/tags?q=prod&limit=100
Returns every tag currently in use across the caller’s targets (enabled or disabled), with target count, sorted by descending count then alphabetical. q is a prefix filter for autocomplete. Scoped to the active org — in SaaS mode another org’s tags are invisible.
{ "items": [ { "name": "prod", "count": 12 }, { "name": "staging", "count": 4 } ],
"total": 2, "limit": 100, "offset": 0 }
Dashboard summary
GET /api/v1/dashboard/summary — per-org rollup cached in-process for 5 seconds (keyed by OrgId, so two tenants never share an entry).
{
"targets": { "total": 42, "enabled": 40, "disabled": 2 },
"current_status": { "up": 38, "down": 1, "degraded": 1, "error": 0, "unknown": 2 },
"last_24h": { "checks_total": 50400, "checks_up": 50360, "uptime_pct": 99.92, "incidents": 3 },
"system": { "in_flight_checks": 5, "result_queue_depth": 12, "dropped_results_last_5m": 0, "circuit_breakers_open": 0 }
}
On-demand operations
POST /api/v1/targets/test— runs one check against a rawCheckSpec, no persistence. Same SSRF / URL-scheme / port validation asPOST /targets. ReturnsTestResponse { result, matched_expectations, warnings }.POST /api/v1/targets/{id}/check-now— runs one check against an existing target using its stored credentials, dispatched to an agent in the target’s region. Result is persisted. Returns503 PROBE_UNAVAILABLEif no agent is currently serving the region.POST /api/v1/targets/bulk-action— apply one action atomically to up to 10,000 ids. Partial failure allowed; the response listssucceededandfailedseparately, with per-idcode+message.
{
"ids": ["01h7m...", "01h7n..."],
"action": { "type": "disable" }
// alternatives: { "type": "enable" }, { "type": "delete" },
// { "type": "tag_add", "tags": ["frozen"] },
// { "type": "tag_remove", "tags": ["frozen"] }
}
Idempotency
POST /api/v1/targets/bulk and POST /api/v1/targets/bulk-action accept an optional Idempotency-Key header. The server stores the response for 24 hours keyed by (header value, body hash). A retry with the same key and body returns the original response without re-executing. A retry with the same key but a different body executes normally — the body hash is part of the cache key. The cache is in-process; entries are lost on restart.
POST /api/v1/targets/bulk-action HTTP/1.1
Idempotency-Key: 01h7m8z4n6v0e1m7v7y6x8x8x8
Content-Type: application/json
{ "ids": ["..."], "action": { "type": "disable" } }
Terraform
Manage your monitors and notification channels as code with the official
Terraform provider,
uptimepage/uptimepage.
The Terraform Registry page is the full reference — every resource, attribute, and data source, regenerated from the provider on each release. This page is a quick start; it links out rather than duplicating that reference.
Quick start
terraform {
required_providers {
uptimepage = {
source = "uptimepage/uptimepage"
}
}
}
provider "uptimepage" {
token = var.uptimepage_token # or set UPTIMEPAGE_TOKEN
org = "your-org-slug" # or set UPTIMEPAGE_ORG
# endpoint defaults to https://app.uptimepage.dev; set it for a self-hosted instance
}
resource "uptimepage_target" "api" {
name = "api prod"
interval = 60
check = {
type = "http"
http = {
url = "https://example.com/healthz"
expected_status = { kind = "exact", exact = 200 }
}
}
}
Credentials
- Token — create one at Settings → API tokens (
/settings/api-tokens; requires a verified email). Supply it via thetokenattribute or theUPTIMEPAGE_TOKENenvironment variable. The full token is shown once. Grant the least scope the provider needs:targets:write+channels:writecovers both managed resources (writeimpliesread, and Terraform only deletes duringterraform destroy). Addtargets:delete+channels:deleteonly if you rundestroy. For defence in depth, bind the token to the org you manage so a leak can’t reach your other orgs. - Org — API tokens are user-scoped, so every request must name an
organization. Set
org(the org slug) orUPTIMEPAGE_ORG; it is sent as theX-Uptimepage-Orgheader. Without it the API returns400 ORG_REQUIRED. Find your slug fromGET /api/v1/orgsor your dashboard URL. A token bound to an org requiresorgto match it (else403 ORG_HEADER_MISMATCH). - Endpoint — defaults to the hosted API at
https://app.uptimepage.dev. For a self-hosted instance, setendpointto your host (the apex marketing domain does not serve/api/v1).
Resources & data sources
| Name | Kind | Manages |
|---|---|---|
uptimepage_target | resource | Monitors — http, tcp, tls_cert, domain_expiry, dns checks |
uptimepage_notification_channel | resource | Alert destinations — webhook, slack, telegram, whatsapp. The pagerduty/ntfy/pushover/sms kinds land in a provider release after the API ships them. The one-tap telegram_app and whatsapp_app kinds are not manageable: their configs are minted by the link flows and the API rejects them in request bodies (CHANNEL_KIND_MANAGED) |
uptimepage_target | data source | Look up an existing target by id |
For the full attribute reference and an example per check type, see the provider docs on the Terraform Registry.
Managed-by badge
Resources the provider creates or updates carry a terraform source marker
(the provider identifies itself on every request). The web UI shows a small
terraform chip next to those monitors and channels, plus a banner on the
monitor detail page, so anyone browsing knows the resource is managed as code.
The marker is informational — the UI does not lock the resource. But an edit
made in the UI flips its badge to ui and will be overwritten the next time
you run terraform apply, since your .tf files remain the source of truth.
Change managed resources in Terraform, not the UI.
Source
Provider source and issue tracker: https://github.com/uptimepage/terraform-provider-uptimepage.
Web UI
The same Rust binary that serves /api/v1/* also serves a server-rendered HTML UI on the same port. Open http://<host>:<api-port>/ in a browser.
Stack
| Layer | What | Where |
|---|---|---|
| Templates | askama 0.16 + askama_web 0.16 (compile-time HTML, type-checked by cargo build) | templates/ |
| Interactivity | HTMX 2.0.9 + json-enc (partial swaps, JSON form submission — no SPA framework) | static/js/htmx.min.js, static/js/json-enc.js |
| Charts | ECharts 6 (lazy-loaded only on pages that need it) | static/js/echarts.min.js, static/js/charts/ |
| CSS | Tailwind 4.3 (CSS-first config via @import, @source, @theme, @layer) | static/css/input.css → app.css |
| Asset serving | rust-embed — assets are baked into the binary at compile time | src/web/assets.rs |
After cargo build --release you have one ~23 MB executable that contains every template, every CSS byte, and every vendored JS file. No node, no bundler, no separate frontend service.
Routes
| Path | Purpose |
|---|---|
GET / | Dashboard. Auto-refreshing region polls /web/partials/dashboard every 5 s; donut + 24h bar pull from /api/v1/dashboard/summary. |
GET /targets | Targets list. Filter by name (client-side), tag, enabled. Row delete + paginate via HTMX. Rows authored by an API token or Terraform carry a managed-chip (api / terraform); UI-authored rows show none. |
GET /targets/{id} | Target detail. Status badge, four time-range presets (1h/24h/7d/30d), uptime KPIs, latency p50/p95/p99 line, DNS/connect/TLS/TTFB stacked area, recent-results table, redacted JSON config. Externally-managed monitors also get a managed-by chip and a banner warning that UI edits may be overwritten on the next apply. |
GET /targets/new | Create form. Posts JSON to /api/v1/targets. Detection (open-incident-after-N-fails, region quorum) and Notifications (channel bindings, remind-while-down cadence, notify-on-recovery) are separate sections; the notification controls only render when the org has channels. |
GET /targets/{id}/edit | Edit form. Same template as new but data-mode="edit"; credential fields land in redacted mode and the operator must explicitly toggle “Replace credentials” before new values are sent. |
GET /web/targets/list | HTMX partial (<tbody> fragment) for filter/paginate swaps on the targets list. |
GET /settings/notifications | Notification-channel list. Send-test / edit / delete are HTMX row actions against /api/v1/notification-channels; the table body polls /web/partials/settings/notifications every 60 s. |
GET /settings/notifications/new, …/{id}/edit | Channel create/edit form (Slack / Discord / Teams / Google Chat / generic webhook / Telegram / WhatsApp / SMS; the provider-branded cards take just the provider’s webhook URL, host-checked on create; the SMS card carries a gateway sub-selector — Twilio / Vonage / Telnyx / Plivo / Sinch — and takes that gateway’s own credentials). With provider OAuth configured, the slack/discord panel grows an “add to Slack” / “add to Discord” button (plus a QR variant for a signed-in phone): the provider’s consent screen picks the destination channel and the callback lands on the ready-made channel’s edit page; cancelling, a failed exchange, or the plan’s channel limit bounce back to the form with a quiet note. On deployments running the central Telegram bot, a one-tap telegram card joins the lineup (the BYO card reads telegram bot): “connect telegram” mints a single-use code, shows it as a t.me link + QR with a private-chat/group toggle, polls until the chat presses Start, then opens the channel the webhook created. Linked channels are display-only (chat title + id, no secrets, no replace toggle); unlink = delete. If the chat side unlinks first (bot removed, /stop), the channel is disabled with a visible “unlinked from the Telegram side” note that re-enabling clears. The Telegram panel has a setup helper: a t.me QR for the bot (scan, press Start) and a one-click chat-id probe, both talking to the Bot API straight from the browser. “Test now” delivers a synthetic alert before saving (create posts the form config to …/test; a locked edit tests the stored channel by id). On edit the stored secret stays masked behind a “Replace transport config” toggle — leaving it off omits config from the PATCH and locks the type cards (the kind rides the config). The edit page also lists the monitors bound to the channel, lets a “+ add monitor” picker bind more (it updates the monitor’s alert bindings through PATCH /api/v1/targets/{id}), and offers delete with that blast radius spelled out — deleting a channel also removes its bindings from every monitor. |
GET /settings/pages | Status-pages list — create / rename / publish / delete pages (free plan: one). Create posts to /api/v1/status-pages; the list body refreshes via /web/partials/settings/pages. |
GET /settings/pages/{id} | Per-page editor: URL slug (own save — a rename is a hard cutover), branding, logo, and the component curation list (per-monitor on-page toggle, public name/group). Each edit autosaves via the /api/v1/status-pages/{id} + /components endpoints. |
GET /settings/team | Team management (owner-only): invite by email + role, pending-invitation revoke, member remove / leave, owner⇄member role toggle — all row actions confirm via modal and hit /api/v1/orgs/{id}/members + /api/v1/orgs/{id}/invitations. Non-owner members see a read-only note. |
GET /web/partials/settings/team | HTMX partial — seats line + members + pending-invitations tables; re-pins the target org id on every refresh. |
GET /web/partials/settings/pages | HTMX partial — the page rows for the list above. |
GET /web/partials/dashboard | HTMX partial — chrome-free dashboard region; self-rearms so each refresh still carries hx-trigger="every 5s". |
GET /m/{token} | Public read-only share of one monitor — same detail dashboard, no operator chrome, credentials redacted, no login. Sub-resources (/live, /incidents, /latency, /results) are twinned under the token so the page never calls an operator URL. See Share links. |
GET /docs | Swagger UI generated from /api/openapi.json. |
GET /static/{path} | Embedded assets (css/, js/, img/). |
Every mutation goes through /api/v1/*. There are no /web/* write routes — the JSON API stays the single source of truth, which means a future SvelteKit port is a templates-only rewrite. The /m/{token} share surface is read-only and serves no write method.
Build pipeline
cargo build [--release]
└─► build.rs
├─► (first build only) scripts/fetch-tailwind.sh — downloads the Tailwind
│ standalone CLI (~30 MB, not committed) for the host platform into bin/
└─► ./bin/tailwindcss --minify
--input static/css/input.css
--output static/css/app.css
└─► rustc
└─► rust-embed bakes static/ + templates/ into the binary
build.rs declares rerun-if-changed on templates/, src/, static/css/input.css, and scripts/fetch-tailwind.sh. Editing any of them triggers a Tailwind rebuild on the next cargo build.
Tailwind 4 scans both templates/**/*.html and src/**/*.rs for utility class names (declared via @source in input.css), so utility classes written inside Rust strings are preserved through tree-shaking.
Styling: the semantic layer
static/css/input.css is layered: design tokens (@theme, e.g. --color-ink) → primitives (.sticker-card, .sticker-btn, .sticker-pill) → semantic classes (.page-title, .panel-label, .kpi-value, .stat-tile, .status-badge--*, .btn-ghost, .sticker-btn--primary/--danger, .nav-link, .day-cell). Templates reference only the semantic names — no raw colour/shape utility clusters. State is one --modifier (.status-badge--down, .stat-tile--ok). Result: re-skinning the internal app is an input.css-only edit, no template touched. When adding UI, reuse/extend a semantic class rather than inlining bg-*/rounded-*/heading-scale clusters. The public status page is deliberately exempt — it’s a flat, brand-themed surface with its own view-supplied palette (public_status.rs), not the cartoon sticker system.
Dashboard refresh model
The dashboard splits into three regions:
- Chrome (nav, page header) — rendered once.
- Auto-refresh region (
<div id="dashboard-region">) — KPI cards + system-health card. Polls/web/partials/dashboardevery 5 s and swaps its own outer HTML so the trigger remains armed. - Charts (donut + 24h composition bar) — placed outside the refresh region so the ECharts instances persist across polls. The chart wrapper listens for
htmx:afterSettleon the region and re-fetches/api/v1/dashboard/summaryonce per cycle, fanning out to both charts (single network round-trip, not one per chart).
The dashboard_summary handler caches its result in state.dashboard_cache for 5 s, so the polling load on Postgres + ClickHouse is bounded to one query set per 5 s regardless of how many tabs are open.
Credential redaction
For basic_auth and bearer_token the form runs a three-state machine in static/js/ui/auth_field.js:
data-mode | Inputs | Submit behaviour |
|---|---|---|
create | enabled, empty | Field included in POST body if filled. |
redacted | disabled, sentinel *** shown | Field omitted from PATCH body. |
replacing | enabled, empty | Field included with the real value. |
The API rejects the *** sentinel on write as defence-in-depth — but the state machine prevents the form from ever submitting it. End-to-end coverage in tests/web_e2e_test.rs::edit_form_shows_redacted_auth_state_for_existing_target asserts that real credentials never appear in the rendered edit form.
Tests
| Layer | What |
|---|---|
| Unit (template render) | Every view in src/web/views/ ships a #[test] that renders the template with a fixtures struct and asserts on the output: HTMX hooks, redaction sentinels, chart data-endpoints, table scaffolding. |
| End-to-end | tests/web_e2e_test.rs drives the merged api+web router via tower::ServiceExt::oneshot, covering dashboard (full + partial), list (full + partial), forms (create + redacted-edit), target detail with chart anchors + time-range nav, 404 paths, and the immutable cache header on /static/*. |
| Build-time | cargo build rejects template type mismatches — askama checks templates against the corresponding Rust struct at compile time. |
cargo test --lib web:: # unit render tests
cargo test --test web_e2e_test # end-to-end
Adding a new page
- Add a template under
templates/extendingbase.html. - Add a
#[derive(Template, WebTemplate)]struct and an axum handler insrc/web/views/. - Register the route in
src/web/routes.rs. - Tailwind picks up new utility classes automatically (the
@sourcedirective scanstemplates/**/*.html+src/**/*.rs). - Add a render test next to the view and, if there’s a route worth covering end-to-end, append a case to
tests/web_e2e_test.rs.
Troubleshooting
| Symptom | Likely cause |
|---|---|
failed to spawn ./bin/tailwindcss during cargo build | First-build fetch failed. Run bash scripts/fetch-tailwind.sh manually and confirm bin/tailwindcss is executable. |
| Page renders unstyled HTML | static/css/app.css empty or stale. Touch static/css/input.css and rebuild; the build script runs Tailwind with --minify. |
| Charts render blank | Open DevTools console. Most likely a fetch to /api/v1/dashboard/summary or /api/v1/targets/{id}/results failed — the chart module logs chart load failed with the URL and status. |
| Dashboard never refreshes | Confirm <script defer src="/static/js/htmx.min.js"> is in the page source. The HTMX bundle is loaded from base.html. |
| Edit form submitted credentials despite the toggle being off | Look for a console error from auth_field.js. The submit handler reads data-mode from the credential <fieldset> — if the fieldset is missing the data attributes, it will fall back to “include”. |
Migrating to a SPA later
The design keeps a SPA port cheap. Every templates/*.html maps one-to-one to a Svelte (or React) component, every chart module under static/js/charts/ is already a pure (element, endpoint) → disposer function that imports unchanged into onMount, and there are zero /web/* write endpoints to refactor — only read partials. To swap frameworks:
- Generate a typed JSON client from
/api/openapi.json. - Port the templates page-by-page; keep
/api/v1/*unchanged. - Drop
src/web/views/(keepsrc/web/assets.rspointing at the new bundle). - Delete
templates/andstatic/js/{htmx,json-enc,ui}— no longer needed.
The backend (src/api/, src/storage/, src/scheduler/, src/worker/) stays untouched.
Public status page
The public status page is the customer-facing surface — an unauthenticated
HTML page at /status plus a small JSON + RSS API under
/api/public/v1/*. It’s the only part of uptimepage that’s safe to
expose on the open internet without basic auth in front of it.
This chapter is for operators: how to publish a component, narrate an incident, and schedule a maintenance window. For the wire-level details of the underlying endpoints see REST API. For Caddy + the rate-limit plugin see Deployment.
Multi-tenant operators read this first. This chapter describes the page itself; the workflow is identical on every page. In a multi-tenant deployment each org runs one or more pages at
{slug}.{base_domain}— settenancy.subdomain_public_routes = trueand leavetenancy.path_based_public_routesoff. The path-based/statussurface is single-org and is for single-tenant deploys only (the default). See Per-org status pages for the routing, branding, and isolation model, and Public status routing for the flag matrix.
What’s published vs what’s private
By default every target is private. A monitor becomes a “component”
on a status page only when it is curated onto that page — there is no
per-target “public” flag. The aggregator filters at the SQL layer (a
page renders only the monitors bound to it) and the wire types literally
cannot serialise sensitive fields (url, headers, basic_auth,
bearer_token are not part of any public schema), so a misconfiguration
cannot leak credentials.
A monitor is published by adding it to a page; the per-page presentation lives on that binding, so the same monitor can appear on several pages under different names:
| Per-page field | Purpose |
|---|---|
| (binding exists) | the monitor appears as a component on that page |
public_name | display name on this page; falls back to the operator-side monitor name when unset |
public_description | optional one-liner shown under the component name |
public_group | optional group label; components with the same value cluster together. Ungrouped components render last |
sort_order | integer sort key within a group (ASC); the reorder endpoint rewrites it |
A page belongs to an org and is managed by that org’s owner; see
Per-org status pages for the page model, the
max_status_pages / max_public_components caps, and isolation.
Enabling a component
The quickest path is the UI: open the page in Settings → Pages → {your page}. The editor lists every monitor in the org; toggle one on page, optionally set a Public name (blank shows the real monitor name) and a Group. Each edit autosaves via the components API below.
For scripting, add the monitor to the page, then set its per-page curation:
# Add monitor $TARGET_ID to page $PAGE_ID
curl -X POST http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components \
-H 'content-type: application/json' \
-d '{"target_id": "'$TARGET_ID'", "public_name": "Public API", "public_group": "Core APIs"}'
# Edit the per-page name / description / group later
curl -X PATCH http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components/$TARGET_ID \
-H 'content-type: application/json' \
-d '{"public_description": "Primary REST surface, all regions."}'
# Remove it from the page
curl -X DELETE http://127.0.0.1:8080/api/v1/status-pages/$PAGE_ID/components/$TARGET_ID
On the PATCH, public_name, public_description, and public_group
use the same three-state semantics as incident narration: omit the
field to leave it unchanged, send a string to set it, or send JSON
null to clear it back to the default (real monitor name / no group).
Blanking the field in the UI clears it for you.
Adding a monitor that’s already on the page is an idempotent no-op.
Adding a brand-new monitor when the org is at its max_public_components
cap is a quota error; a monitor already published on another page costs
nothing to add here.
The page is cached for 10 s in-process (moka single-flight, with a second moka last-known-good cache so transient ClickHouse failures don’t break the page). Changes appear on the next refresh.
Narrating an incident
The background incident writer opens an incident automatically when a
public target trips the threshold; it closes it again when checks
recover. Both events happen without operator action. What’s manual is
the narration — the human-readable title, description, severity,
and the running timeline of “investigating → identified → monitoring →
resolved” entries that show up on /status and in the RSS feed.
Update the title + severity:
curl -X PATCH http://127.0.0.1:8080/api/v1/incidents/$INCIDENT_ID \
-H 'content-type: application/json' \
-d '{
"public_title": "Elevated 5xx in EU-WEST",
"public_description": "Origin rollout regression — rolling back.",
"severity": "major"
}'
Sending JSON null for public_title or public_description clears
the field and lets the page fall back to its auto-generated wording.
Omitting the field leaves it unchanged.
Append a status update to the timeline:
curl -X POST http://127.0.0.1:8080/api/v1/incidents/$INCIDENT_ID/updates \
-H 'content-type: application/json' \
-d '{
"phase": "identified",
"message": "Rolled back the offending deploy. Verifying recovery."
}'
phase is one of investigating, identified, monitoring,
resolved, postmortem. Posting resolved does not end the
incident — the incident lifecycle is driven by check results, so manual
“resolved” entries are advisory only. Posting an update to an
already-ended incident is allowed (useful for postmortems).
Validation rules:
| Field | Rule | Error code |
|---|---|---|
public_title | non-whitespace, ≤ 200 chars (use JSON null to clear) | EMPTY_TITLE / TITLE_TOO_LONG |
public_description | ≤ 5 000 chars (use null to clear) | DESCRIPTION_TOO_LONG |
message (update) | non-whitespace, ≤ 2 000 chars | EMPTY_MESSAGE / MESSAGE_TOO_LONG |
phase (update) | exactly one of the five values above | 400 / 422 from the JSON extractor |
Scheduling maintenance
A maintenance window is a planned outage. While the window is active,
the page renders affected components as Maintenance (the truth-table
rule is: maintenance dominates outage, so a real failure during the
window still classifies as Maintenance, not MajorOutage). On the
90-day history strip, any day that overlapped a maintenance window
renders as a maintenance cell rather than an outage cell.
Create:
curl -X POST http://127.0.0.1:8080/api/v1/maintenance \
-H 'content-type: application/json' \
-d '{
"title": "PG13 → PG16 cutover",
"description": "Read-only for ~30 minutes.",
"starts_at": "2026-05-14T22:00:00Z",
"ends_at": "2026-05-14T23:00:00Z",
"component_ids": ["01a7b1ce-0000-7000-8000-000000000001"]
}'
List, edit, delete:
curl 'http://127.0.0.1:8080/api/v1/maintenance?status=upcoming&limit=10'
curl -X PATCH http://127.0.0.1:8080/api/v1/maintenance/$ID \
-H 'content-type: application/json' \
-d '{"title": "PG cutover (postponed)"}'
curl -X DELETE http://127.0.0.1:8080/api/v1/maintenance/$ID
Validation rules:
| Field | Rule | Error code |
|---|---|---|
title | non-whitespace, ≤ 200 chars | EMPTY_TITLE / TITLE_TOO_LONG |
description | ≤ 5 000 chars | DESCRIPTION_TOO_LONG |
ends_at | strictly after starts_at | INVALID_TIME_RANGE |
ends_at - starts_at | ≤ 30 days | INVALID_DURATION |
component_ids | every id must reference an existing target | INVALID_COMPONENT_ID |
PATCH on a window whose ends_at is already past | rejected | 422 MAINTENANCE_COMPLETED |
For audit, prefer PATCHing a cancelled window’s title (e.g. "[cancelled] PG cutover") over hard-deleting historical entries.
What the public page renders
- Banner — one of
All Systems Operational,Maintenance in progress,Minor Service Disruption,Partial System Outage,Major System Outage. Driven by the worst component state, with maintenance precedence as described above. - Component groups — each component shows its current state, a 90-day history strip (one cell per day, oldest-first), and the operator-supplied description.
- Active and recent incidents — operator-set
public_titleif present, otherwise an auto-generated"<component> <status>"string. Each incident links to a permalink at/status/incidents/{id}with the full timeline. - Maintenance — active + the next 7 days of upcoming windows.
- RSS feed —
/api/public/v1/incidents.rss. RSS 2.0; each item is a public incident with the latest update as the description.
Refresh behaviour
The page is statically rendered and works without JavaScript. With JS
enabled, an HTMX hx-trigger="every 30s" swaps the dynamic region (the
banner, the component grid, and the incident lists) without a full
page reload. The chrome around it — header, footer, RSS link — stays
put. A small (~35 LoC) static/js/public/tz.js helper rewrites
ISO timestamps into the visitor’s local timezone tooltip; everything
else is plain HTML.
Caddy and the rate-limit plugin
The public surface bypasses basic auth at the Caddy layer through an
@public matcher in deployment/Caddyfile. The matcher also applies a
per-IP rate limit (60 requests / minute), which requires the
caddy-ratelimit plugin.
The stock caddy:2-alpine image doesn’t include it — build a
custom-caddy:2 image once via xcaddy. The procedure is in
Deployment and
deployment/README.md.
If you’d rather not maintain a custom Caddy image, comment out the
rate_limit { … } block in the Caddyfile. The public surface still
serves; you just lose per-IP throttling. Putting Cloudflare in front of
Caddy is the other option.
Embeddable status badge
GET /api/public/v1/badge.svg returns a shields.io-style SVG badge that
operators can embed in README files or external dashboards. Two modes:
<!-- Overall page status -->

<!-- Single component -->

The badge reuses the cached page payload, so it tracks the /status
view inside the 10-second cache window. Unknown component ids return
404 with the public error envelope; only style=flat is recognised
(others return 400).
The page editor renders ready-to-copy markdown for the overall badge and
each on-page component. The copyable URL is built from the page’s public
origin, so on path-based/self-host deploys set auth.public_base_url to the
externally reachable URL (the same value subscriber links need); otherwise the
badge URL points at localhost.
?component=<uuid> works for any public component regardless of check type —
an HTTP, DNS, or TLS-certificate monitor each gets its own badge that reflects
that component’s current status.
Common questions
Can I have a component that’s public but doesn’t trigger incidents?
No. Incident materialisation walks the same binding the page does — a
monitor on any enabled page is eligible for incidents. If you want a
check that’s published but not alerting, set enabled = false on the
alert channels — the incident will still open, but no notification
fires.
Can I publish a maintenance window without listing the affected
components? No. component_ids may be empty in the request body, but
the aggregator filters maintenance windows that touch zero public
components out of the page (and out of the JSON), so they wouldn’t
appear anywhere. List at least one public component.
What’s the cache TTL? 10 s. Single-flight: only one task computes the page when the entry expires; others wait for the result. On ClickHouse failure the last-known-good snapshot serves until the next successful compute.
How long does the 90-day history go back? Exactly 90 days, oldest
day on the left. Cells with no recorded checks render as NoData
(grey); the aggregator does not fabricate data.
Is there an Atom feed? No, RSS 2.0 only. Most feed readers consume both.
Per-org status pages
Each org owns one or more public status pages. A page lives at {slug}.{base_domain} in SaaS mode (acme.example.com, status.acme.example.com, …, apex-wildcard shape) and renders only the monitors that org has curated onto it, with that page’s branding, incidents, and maintenance. A new org starts with one default page (slug = the org slug) created at signup; the owner can rename it, add more pages, or take any page offline.
The number of pages an org can run is plan-capped (max_status_pages); the free plan gets one. Multiple pages let an org split surfaces — e.g. a public page and a separate internal-stakeholder page — each showing a different subset of monitors under a different URL.
This chapter is the per-org / per-page model. For the component, incident, and maintenance workflow (identical on every page) see Public status page. For the wildcard cert and reverse-proxy setup see Deployment and the full runbook in deployment/README.md.
When it applies
| Shape | Config | Public surface |
|---|---|---|
| Single-tenant | tenancy.path_based_public_routes = true (default) | the lone org’s default page, served path-based at /status on the operator host |
| Multi-tenant SaaS | tenancy.subdomain_public_routes = true, tenancy.path_based_public_routes = false | every enabled page at {slug}.{base_domain} |
Single-tenant deploys never pay the subdomain path: there is one live org, so its default page is mounted on the operator host at /status.
Path-based and subdomain public routes are mutually exclusive — serving /status on the operator host alongside subdomains would publish one page’s data at every tenant’s expected URL. Pick one.
Host routing
A page is resolved from the request Host header, not the path. The slug names a page, not an org; the lookup admits only enabled pages whose org is not soft-deleted.
| Host | Result |
|---|---|
acme.example.com, page enabled | that page |
acme.example.com, page disabled (draft) or org soft-deleted | 404 |
nope.example.com, no such page slug | 404 |
a.b.example.com (extra label) | 404 |
example.com (no slug label, bare base) | 404 |
missing Host header | 404 |
A page slug is globally unique (it routes a subdomain), so two orgs can never claim the same slug. base_domain must be a multi-label domain (it needs at least one dot); the boot assertion refuses an empty or single-label value, because a loose base would let the slug extractor match arbitrary Host headers.
The apex wildcard *.{base_domain} DNS record plus a wildcard TLS cert (Let’s Encrypt via the Hetzner DNS-01 challenge) means a new page works the instant it is enabled — no per-page DNS or cert step. Operator subdomains (app.{base_domain}, mail.{base_domain}, …) use explicit DNS records that take precedence over the wildcard, and the operator host is kept on its own per-host cert.
Managing pages
The org owner manages pages from the operator UI at /settings/pages (a list to create / rename / publish / delete pages) and the per-page editor at /settings/pages/{id} (URL slug, branding, logo, and which monitors appear). The same operations are available over the API:
| Endpoint | Purpose |
|---|---|
GET /api/v1/status-pages | list this org’s pages |
POST /api/v1/status-pages | create a page (capped at max_status_pages) |
GET /api/v1/status-pages/{id} | one page + its live URL and logo URL |
PATCH /api/v1/status-pages/{id} | rename, change slug, publish/unpublish, edit branding |
DELETE /api/v1/status-pages/{id} | delete the page (its component bindings cascade) |
GET /api/v1/status-pages/{id}/components | the monitors curated onto the page |
POST /api/v1/status-pages/{id}/components | add a monitor to the page |
PATCH /api/v1/status-pages/{id}/components/{target_id} | set per-page name / description / group |
DELETE /api/v1/status-pages/{id}/components/{target_id} | remove a monitor from the page |
POST /api/v1/status-pages/{id}/components/reorder | set the component order |
POST /api/v1/status-pages/{id}/logo | upload a logo (multipart) |
DELETE /api/v1/status-pages/{id}/logo | remove the logo |
Every route is scoped to the caller’s active org: a page id that isn’t in that org resolves to 404 (the same cloak as the rest of the API), so an owner of one org can neither see nor mutate another org’s page.
Page identity and branding
| Field | Rule | Default when unset |
|---|---|---|
name | 1–80 chars; the operator-facing label in the Pages list (not shown publicly) | — (required) |
slug | globally-unique subdomain slug; 3–30 chars, lowercase letters / digits / hyphens, starts with a letter. A rename is a hard cutover — the old URL stops working immediately | — (required) |
enabled | published? a draft (false) 404s on its public host | off on create via the API; the signup default page is on |
public_display_name | 1–80 chars | the org’s name |
public_brand_color | #RRGGBB (6-digit hex) | #3b82f6 |
public_about | Markdown, ≤ 500 chars, rendered to sanitised HTML | omitted |
public_style | one of the named themes | default |
public_show_powered_by | footer attribution toggle | on |
| logo | PNG / JPEG / WebP, ≤ 1 MB, ≤ 1200 px; larger images are downscaled. Format is sniffed from the bytes (declared content-type ignored — a script/SVG can’t masquerade as an image) and the decoder is allocation- and dimension-bounded against decompression bombs | header shows the display name as text |
A PATCH with a branding object replaces the display fields wholesale; name, slug, and enabled are independent partial fields. The logo has its own endpoints and is never touched by a branding edit. The editor shows the live URL so the owner can preview exactly what visitors see.
Curating components
A monitor appears on a page only while a status_page_components binding exists for that (page, target) pair. Adding the monitor in the editor creates the binding; removing it deletes the binding. The per-page curation lives on the binding, so the same monitor can sit on several pages under different names:
| Per-page field | Purpose |
|---|---|
public_name | display name on this page; falls back to the operator-side monitor name when unset (1–80 chars) |
public_description | optional one-liner under the component name (≤ 200 chars) |
public_group | optional group label; same value clusters together, ungrouped renders last (≤ 50 chars) |
sort_order | integer sort key within a group (ASC); the reorder endpoint rewrites it |
The per-page distinct-target cap is max_public_components: it counts unique monitors across all of the org’s pages. A monitor already published on one page costs nothing to add to another; a brand-new monitor at the cap is rejected with a quota error. Adding a monitor already on the page is an idempotent no-op; adding a page or target that isn’t in the caller’s org is a 404, not a quota error.
About text
public_about is Markdown. It is parsed and then run through an HTML sanitiser before it ever reaches a template: only p, strong, em, a, br, ul, ol, li survive, links get rel="noopener nofollow", and there is no raw-HTML escape hatch. Scripts and inline styles are stripped.
Brand colour
The colour is validated at three independent layers — the database constraint, the application validator, and again in the template right before it is written into the page’s <style>. Any value that isn’t a strict 6-digit hex falls back to the default at render time, so a relaxed constraint at one layer can’t open a CSS-injection path on its own.
Logo storage
An uploaded image’s format is detected from its bytes, not its declared content type. The on-disk filename is derived from the page and a hash of the content, never from anything the client sends, so a crafted filename can’t escape public_status.logo_dir. Replacing or removing a logo deletes the previous file.
Caching and turning a page off
Each rendered page is cached for public_status.cache_ttl_secs (default 10 s), keyed by page id. A separate last-known-good layer keeps the most recent successful render per page so a transient Postgres/ClickHouse blip serves slightly stale data instead of an error. That layer is bounded by cache_max_orgs and idle-evicts after last_good_ttl_secs, so churn through many pages can’t grow it without limit.
Unpublishing a page (enabled → false) makes the host resolver stop resolving its slug; the cache entry idles out, so the page is a 404 within one TTL window at most. Deleting a page or soft-deleting the org has the same effect (the purge worker handles the org case).
Security model
- Published only. The public host resolver admits a page only when it is enabled and its org is not soft-deleted. A draft or deleted page’s slug resolves to 404 even though the string still exists. The authenticated org lookup is a separate function and is never used on the public path.
- Operator sessions never reach status subdomains. The session cookie is host-only (
auth.session.cookie_domain = ""), so the browser scopes it to the operator host and never sends it to*.{base_domain}. The binary refuses to boot ifcookie_domainis set to a parent zone that would overlap the apex wildcard. - No operator surface on the page. The status page renders no operator UI, sets no cookies, and never echoes request auth headers.
- Tenant isolation. A request for one page returns only that page’s curated monitors; the page cache and every data source are keyed by page id, and the underlying queries bind the org id, end to end. A monitor not bound to the page is never queried for it, so its operator-side name can’t leak.
Configuration
The [public_status] block and the split tenancy flags are documented in Configuration → Public status page and Configuration → Multi-tenancy mode.
Coming later: custom domains
Today every page is served under the shared *.{base_domain} apex wildcard. A future release will let an org point its own hostname (e.g. status.theirbrand.com) at a specific page:
- the org adds a
CNAMEto{slug}.{base_domain}and registers the custom hostname on the page’s settings; - the reverse proxy issues a per-hostname certificate on demand (no wildcard for custom domains — each is a distinct name);
- host resolution gains a custom-domain → page lookup ahead of the subdomain parser; everything downstream (cache, branding, isolation) is unchanged.
This is intentionally additive: the subdomain path keeps working as the always-available default, and nothing in the current data model blocks it. Custom domains are not available yet — track the roadmap before promising a customer a vanity status URL.
Share links
A share link gives anyone a read-only window into a single monitor — no account, no login. Open /m/{token} and you get the same detail view a logged-in member sees: live status, uptime, latency and response-time charts, recent check results, and the incident history. Paste the link in a chat channel, drop it in a ticket, or send it to a customer who needs to watch one endpoint without access to your org.
It is distinct from a status page: a status page is a branded, curated, multi-monitor public surface on its own subdomain; a share link is a capability URL to one monitor’s full dashboard.
What a viewer sees
Everything the operator detail view shows, with two deliberate differences:
- Read-only. No edit, delete, run-check-now, enable/disable, or navigation to the rest of the app. The page is its own shell with none of the operator chrome.
- Credentials redacted. The monitor’s check configuration is shown (so a viewer can see what is being checked and how), but any
bearer_tokenorbasic_authis replaced with***. The live credential never reaches the page.
The page auto-refreshes its live region and charts just like the operator view, scoped entirely to the token — it never calls an operator or API URL.
The token
Minting a link returns a 256-bit random token; the URL is /m/{token}. The token is the capability — anyone holding it can view the monitor, and forwarding the link grants access. The controls are revoke (kill it now) and an optional expiry (kill it at a set time); a link with no expiry lives until revoked or the monitor is deleted.
The link is re-copyable, like a Google Docs or Dropbox share link: open the Share modal (or the list endpoint) any time to copy the same URL again. Lost the chat you posted it in? Copy it again — you only get a new token when you revoke and create one.
Limits come from the org’s plan (plans columns, overridable per-org): the free plan allows 1 active link per monitor and shares on at most 2 distinct monitors per org. Revoke a link to free a slot.
To make that possible the token is stored encrypted at rest with the app KEK (the same Cipher that protects basic_auth/bearer_token), so a raw database or backup dump without the key yields nothing usable. The public lookup matches on a separate one-way hash, so a hot link never triggers a decrypt. With no KEK configured the token is stored in plaintext (same fallback as target credentials); if a token was sealed under a key that is later removed, the link shows as un-copyable rather than broken.
A bad, expired, revoked, or deleted-monitor token all return the same 404 — there is no signal that distinguishes “wrong token” from “revoked token”, so the surface cannot be enumerated.
Managing links
From the API (member-level targets:write):
# Mint a link (optionally labelled, optionally expiring)
curl -X POST https://app.example.com/api/v1/targets/$ID/shares \
-H 'Content-Type: application/json' \
-d '{"label":"Slack #ops","expires_at":"2026-12-31T00:00:00Z"}'
# → { "id": "...", "label": "Slack #ops", "token": "…", "view_count": 0, ... }
# build the link as /m/{token}
# List the monitor's live links (each carries its token for re-copy)
curl https://app.example.com/api/v1/targets/$ID/shares
# Revoke one
curl -X DELETE https://app.example.com/api/v1/targets/$ID/shares/$SHARE_ID
The same actions are available from the monitor’s detail page in the UI. See the REST API for the endpoint contract.
Where links live
Share links resolve on the operator app host, not on a per-tenant status subdomain. A monitor’s deletion cascades to its shares, so removing a monitor revokes every link to it.
Abuse
The surface is anonymous, so per-IP request throttling is handled at the reverse proxy. App-side, the live region is served from a short-lived shared cache and every data read inherits the same time-window and page-size limits as the operator API, bounding the cost of any single request.
Incident management
uptimepage turns a failing check into a first-class operational incident: a tracked lifecycle with acknowledgement, ownership, paging, on-call rotations, escalation, and a retrospective — not just a banner on a status page. This chapter is for operators running incident response. For the customer-facing surface it publishes to, see Public status page; for the wire-level endpoints see REST API.
The core idea: internal state is not public phase
The single most important distinction is that what your responders see is orthogonal to what your customers see. Conflating the two is the classic incident-tooling bug, so uptimepage keeps three independent axes on one incident:
| Axis | Values | Audience | Changed by |
|---|---|---|---|
| Internal state | triggered → acknowledged → resolved | Responders | Acknowledge / resolve / reopen actions |
| Public phase | investigating / identified / monitoring / resolved / postmortem | Customers on a status page | Operator-posted public updates only |
| Visibility | internal / public | — | An explicit publish action |
Acknowledging an incident stops escalation and records who took it — it posts nothing to a status page. Customers see something only when you publish the incident and post a public update. An incident can run its whole internal lifecycle while staying internal.
How an incident opens
A background writer scans every enabled monitor (not only status-page components). When a monitor sustains a bad state — down, error, or degraded — it opens one incident; a sustained recovery to up resolves it automatically (with no human resolver recorded). One open incident per monitor at a time; duplicate failures fold into it.
Visibility is derived at open time: if the monitor is a component of an enabled status page the incident opens public, otherwise internal. A monitor on no page still gets a fully tracked internal incident.
You can also declare an incident by hand from the console (/incidents/declare) — for a problem a monitor can’t see, like a customer report or a partner outage. A manual incident may stand alone or link to a monitor, and opens internal.
Each incident carries a severity (minor / major / critical) and an urgency (high pages on-call, low notifies only). A declared incident takes the severity you choose; an auto-opened one currently defaults to major until an operator changes it.
The console
/incidents is the operator console — a management surface distinct from the dashboard’s at-a-glance banner. It lists incidents with severity, state, monitor, assignee, and age, filterable by state. /incidents/{id} is the detail view: header, the action bar, the trigger sample, and the activity log.
The action bar drives the lifecycle:
| Action | Effect |
|---|---|
| Acknowledge | state = acknowledged, records the first acker, stops escalation. Re-acking keeps the original acker and time. |
| Resolve | state = resolved, records the resolver. (A sustained recovery auto-resolves with no resolver.) |
| Reopen | A resolved incident returns to triggered and re-arms escalation. |
| Assign / unassign | Set or clear the owning responder. |
| Add note | Free-text entry on the internal timeline. |
Acknowledge and resolve prompt for an optional note so you can capture the why at the moment you act.
The activity log
Every lifecycle action writes an append-only event to the incident’s internal timeline. Each entry answers who, when, and what: the acting member’s email (system-driven transitions show system; an action taken through the MCP server is badged via MCP), an exact timestamp, and any note. This is the audit trail — the foundation for tracking response is a healthy habit of leaving notes, and the log makes that habit visible.
Paging and escalation
When an incident opens, the escalation engine pages the responsible channels. Paging reuses the existing Slack / Discord / Teams / Google Chat / Telegram (one-tap linked or bring-your-own bot) / WhatsApp / Webhook transports (see Configuration); email and SMS are not wired yet. Telegram rate-limit responses are honoured: a 429 with retry_after pushes the retry out at least that far.
An escalation policy is an ordered ladder of levels. Each level waits a delay, then pages its targets; if no one acknowledges, the engine advances to the next level, and can repeat the ladder a configured number of times before giving up. Acknowledging the incident halts the walk.
A policy’s targets can be:
- a channel — pages that notification channel directly;
- a user — pages the channels that member has chosen to be reached on (see on-call below);
- a schedule — resolves who is on call right now and pages them.
Policies are owner-managed at /settings/escalation: build the ladder, set per-level targets, and pick an org-default policy. Bind a specific policy to a monitor from the monitor’s edit form. Resolution at page time is: the monitor’s own policy, else the org default, else simple mode — the monitor’s bound notification channels are paged directly, with no laddered re-paging.
One notification source. Every down/up notification flows through the incident engine — there is no separate per-monitor alert dispatch, so a monitor can never double-page. The
escalation.enabledswitch gates only the policy machinery (ladder walk, policy UI); with it off, monitors still page their bound channels in simple mode.
While an incident stays unacknowledged, the engine re-sends a reminder on the monitor’s renotify_interval_secs cadence (default hourly, 0 disables); acknowledging or resolving stops both the reminders and any escalation walk. Failed deliveries retry on exponential backoff and are dead-lettered after the attempt cap. Every attempt is auditable: the incident detail page has a Delivery section, and GET /api/v1/incidents/{id}/notifications returns the same log.
On-call schedules
On-call schedules (owner-managed at /settings/on-call) decide which human a user or schedule target pages.
A schedule has a timezone and one or more layers. Higher layers win when stacked. Within a layer, participants rotate in listed order on a cadence:
| Rotation | Handoff |
|---|---|
daily / weekly | Hands off at the same wall-clock time each period, in the schedule’s timezone — stable across daylight-saving changes. |
custom | A fixed number of seconds. |
Overrides cover a specific window with a chosen person (vacations, swaps) and beat the rotation while active. The editor’s calendar builds one by clicking a start day, then an end day, then choosing who covers. A “who’s on call now” widget resolves the current responder, and GET /api/v1/on-call/who answers it programmatically.
Resolution at page time, for a given instant: an override covering that instant wins; otherwise the highest layer that has participants, advanced by its rotation. The result is a set of users.
Contact channels
A resolved user is paged through the org channels they have opted into — each member picks, on the on-call page, which notification channels reach them. A user/schedule target therefore resolves to people, then to their chosen channels; the paging log records the targeted user alongside the channel. If a member has chosen no channels, they resolve but cannot be paged.
Publishing to a status page
Internal incidents never reach customers. Publishing is the explicit gate.
Every public read — the status page, its JSON API, the RSS feed, and the history markers — filters on visibility = 'public', so an internal incident on a public-component monitor never leaks. Monitors that sit on an enabled status page open public automatically; everything else (manual incidents, monitors not on a page) stays internal until you publish.
From the incident detail page, publish flips visibility to public (optionally seeding a public title) and unpublish hides it again. A published incident appears on any status page whose components include its monitor. Narrate it for customers with public updates (the investigating → monitoring → resolved timeline); posting an update is separate from the internal state, exactly as the two-axis model intends.
Postmortems
A resolved incident can carry one postmortem — a retrospective with a summary, root cause, impact, and a list of action items (each with optional owner and a done flag). Write it from the incident detail page (write / edit postmortem).
Publishing a postmortem surfaces it on the public incident page: customers see the summary, root cause, impact, and the action-item text and done state. Internal detail — the action-item owner — is never exposed publicly. A draft stays private until you publish, and publish/unpublish are recorded on the incident’s activity timeline with the acting member, so the retrospective’s own history is auditable.
Metrics and reporting
/incidents/reports is a metrics dashboard over a trailing window (7 / 30 / 90 days):
- MTTA — mean time to acknowledge (
acknowledged_at − started_at). - MTTR — mean time to resolve (
ended_at − started_at). - Total incidents, counts by severity and by state, auto-resolved vs human-resolved, and the noisiest monitors.
The same numbers are available to automation through the MCP get_incident_metrics tool.
MCP tools
An LLM connected through the MCP server can triage and operate incidents within its granted scopes: read the incident list and detail, read metrics, and — with write scope — acknowledge, resolve, and post public updates. Customer-supplied incident text is always returned as labelled data, never as instructions. See MCP server for the full tool table and scopes.
Auth and scopes
| Surface | Requirement |
|---|---|
| Incident lifecycle (ack / assign / resolve / note / publish / declare) | incidents:write — any member; responders are not owners |
| Reading incidents and metrics | incidents:read |
| Escalation policies + on-call schedules (config) | oncall:write (owner-only); oncall:read to view |
There is no incident-delete: incidents are resolved, never deleted, to keep the audit trail intact. Owner and member are the only roles — any member can be assigned, put on a schedule, paged, and can operate an incident; owners manage the escalation/on-call configuration.
Configuration
The [escalation] block (env prefix UPTIMEPAGE_ESCALATION__*) controls the engine:
| Key | Default | Purpose |
|---|---|---|
enabled | false | Enable escalation policies (ladder walk + policy/on-call UI). Off, incidents still page the monitor’s bound channels directly (simple mode). |
tick_interval_secs | 15 | How often the engine sweeps for due escalations and failed-page retries. |
max_pages_per_tick | 500 | Backpressure cap on pages re-sent per sweep. |
max_attempts | 5 | Give up paging a channel after this many failed attempts. |
Per-org limits (max_escalation_policies, max_on_call_schedules, on_call_enabled) are plan quotas; see Quotas & rate limits.
Multi-tenancy
uptimepage runs as a multi-tenant SaaS from a single binary. The active org is always resolved from the authenticated session; there is no compile-time “self-host vs SaaS” mode and no ambient default org.
A single-tenant deployment is just a SaaS deployment where you sign up as the first user — the OAuth callback creates the user, an auto-provisioned org and the owner membership in one transaction. Teams who would rather skip the OAuth round-trip can seed users + organizations + memberships directly with a one-shot SQL script.
The org model
Three tables form the access-control core:
organizations ── memberships ── users
│
└── role: 'owner' | 'member'
Every tenant-scoped table (targets, incidents, incident_updates, maintenance_windows, maintenance_window_components, notification_channels, …) carries org_id NOT NULL and an ON DELETE CASCADE foreign key to organizations. ClickHouse check_results and check_results_1m are partitioned by (org_id, target_id, ts) so single-org queries never full-scan the table.
Slugs
Org slugs are case-insensitive (CITEXT), 3–30 characters, must start with a lowercase letter, and otherwise contain [a-z0-9-] only — no leading or trailing hyphen and no consecutive hyphens. A static reserved list (api, admin, login, …) is rejected at creation.
The placeholder slug a brand-new user’s first org gets at signup takes the shape {adj}-{noun}-{6char} from inline word lists in src/domain/word_lists.rs. The signup transaction returns Ok(None) on a slug collision so the caller wraps the generate-and-insert pair in a 5-attempt retry loop; the birthday-paradox tail above 5 retries is astronomically small. Users typically rename the slug after signup from settings; the org’s default status page is created with the same slug, which the owner can change independently in the page editor.
Three-org owner limit
A user can be owner of at most free_tier_owner_org_limit (default 3) active organisations. Enforced in a single SQL statement that puts the count subquery inside the INSERT … WHERE … so two concurrent creates cannot both win. Soft-deleted orgs do not count against the cap. Invited memberships (role member) are unlimited.
Soft delete and the 30-day purge
Deletion is two-phase to give operators a recovery window and to keep ClickHouse rows out of forever-orphan state.
- Soft delete.
DELETE /api/v1/orgs/{id}flipsorganizations.deleted_at = now(). The org disappears from the user’s switcher and every URL referencing it returns 404 —is_active_membershort-circuits ondeleted_at IS NULL. - Restore window. The original deleter can call
POST /api/v1/orgs/{id}/restorewithindeletion_grace_period_days(default 30); the slug stays held to prevent squatting during this window. - Purge. A daily job (
src/jobs/retention.rs) runs at 03:00 UTC. It first runs the soft-delete purge (src/jobs/purge_deleted.rs::purge_tick):- Selects up to 10 orgs whose
deleted_atis past the grace window. - Per org, in one PG transaction: insert into
clickhouse_purge_queue(idempotent viaON CONFLICT (org_id) DO NOTHING), thenDELETE FROM organizations—ON DELETE CASCADEempties every tenant table. - Drains pending queue rows by issuing
ALTER TABLE check_results DELETE WHERE org_id = ?against ClickHouse for each. The mutation is idempotent; a process restart between halves replays cleanly. - Then hard-deletes up to 10 soft-deleted users past the grace window that hold no live (unexpired, unused) recovery token. The
usersON DELETE CASCADEerases memberships, oauth_identities, api_tokens, invitations, sessions and recovery tokens; rows referencing the user as an actor (login_attempts,org_audit_log,quota_events,plan_overrides) are kept with the actor nulled.
- Selects up to 10 orgs whose
The same daily job then enforces long-horizon data retention from the [retention] config: it deletes login_attempts, quota_events and org_audit_log rows past their windows and reaps sessions that are absolute-expired or idle past auth.session.idle_timeout_days. ClickHouse check_results retention is the table’s own TTL (background merge), kept equal to retention.check_results_days. Short-cadence security sweeps (OAuth-state, magic-link) keep their own faster loops — their frequency is the property.
The outbox table is the load-bearing piece. A naive “DELETE in PG, then DELETE in CH” sequence leaves CH rows orphaned if the worker dies between calls — invisible to queries but on disk forever, breaking the “data fully erased within 30 days” privacy claim.
Per-org caches
AppState keeps tenant-derived caches keyed by OrgId so one tenant’s data cannot leak into another’s response:
| Cache | Type | TTL |
|---|---|---|
dashboard_cache | moka::sync::Cache<OrgId, Arc<DashboardSummary>> | 5 s |
public_status::cache::PageCache | moka::future::Cache<StatusPageId, Arc<PageData>> | 10 s |
PageCache::last_good | moka::sync::Cache<StatusPageId, Arc<PageData>> | retained across inner’s TTL eviction for stale-fallback |
The public-page caches are keyed by StatusPageId, not OrgId: an org can run several pages, each rendering a different subset of monitors, so the cache unit is the page. The underlying aggregator query still binds the org id, so a page only ever sees its own org’s data. PageCache::get_or_compute does per-page single-flight via moka’s try_get_with, so a thundering herd against one page doesn’t fan out into N expensive aggregator builds.
Public status routes gating
Public-status routing has two shapes, gated by tenancy.path_based_public_routes and tenancy.subdomain_public_routes. Path-based routing (/status, /api/public/v1/* on the operator host, scoped to the single live org) is the default and is correct only for a single-tenant deploy. Multi-tenant deployments must flip to subdomain routing ({slug}.{base_domain}) — otherwise every visitor sees the lone org’s data regardless of which slug they expected. The binary panics at boot on the dangerous combinations (subdomain routes with an empty base_domain, or a cookie_domain that overlaps the status wildcard); see Public status routing for the full flag matrix.
Tenant-isolation invariants
These are checked in CI:
- Every runtime SQL statement against a tenant table must include
org_idin itsWHEREclause. Enforced byscripts/check_tenant_isolation.shvia anast-greprule. The only allow-listed call sites aresrc/storage/admin.rs(AdminRepo, cross-tenant by design) andsrc/storage/orgs.rs(operates on theorganizationstable itself), plussrc/jobs/purge_deleted.rs(drains soft-deleted orgs and users across tenants). - Every ClickHouse
SELECT … WHERE target_id = …must have a siblingorg_id = ?term. Enforced byscripts/check_clickhouse_org_scope.sh. - A Postgres trigger on every child table (
incident_updates,maintenance_window_components) raises onorg_idmismatch between child and parent rows. - An integration test (
tests/tenant_isolation_test.rs) provisions two orgs and asserts every per-org store backed by Postgres or ClickHouse only sees its own org’s rows.
If you add a new tenant-scoped table or a new repository, make sure both ast-grep rules cover it before merge.
Org-management API
See REST API for full schemas. The catalogue:
| Method | Path | Purpose |
|---|---|---|
POST | /api/v1/orgs | Create org (slug, name) — caller becomes owner |
GET | /api/v1/orgs | List orgs the caller is a member of |
GET | /api/v1/orgs/{id} | Get one org (member-only) |
PATCH | /api/v1/orgs/{id} | Edit org (owner-only) |
DELETE | /api/v1/orgs/{id} | Soft-delete (owner-only) |
POST | /api/v1/orgs/{id}/restore | Restore within the grace window (only by the deleter) |
GET | /api/v1/orgs/check-slug?slug=… | Slug availability for signup forms |
GET | /api/v1/orgs/{id}/members | List members (owner-only) |
DELETE | /api/v1/orgs/{id}/members/{user_id} | Remove a member (owner-only) |
PATCH | /api/v1/orgs/{id}/members/{user_id} | Change a member’s role (owner-only; refuses to demote the last owner) |
POST | /api/v1/me/active-org | Switch the session’s active org |
GET | /api/v1/me/orgs | Active (non-deleted) orgs |
GET | /api/v1/me/deleted-orgs | Soft-deleted orgs you deleted (restore UI) |
Multi-region probes
Run checks from more than one location and keep every result attributed to the region that produced it. A single control plane owns all state (Postgres, ClickHouse, the web UI, alerting, and a scheduler for its own region); additional boxes run as stateless agents that pull their region’s monitor config and ship results back.
This is opt-in. A default deployment is a single region — the control plane checks everything itself and nothing below changes.
Model
- Control plane — one process holding Postgres + ClickHouse + the web UI + alerting + a scheduler. Its own region is a normal region row identified by
scheduler.region(default"default"); rename it to a real location, it is not a sentinel. - Agent — a process started with
[agent] enabled = true. It runs no database, web UI, or alerting. It pulls its region’s decrypted monitor config from the control plane over authenticated HTTPS, runs the checks locally, and POSTs results back to the central ingest API. Agents never touch ClickHouse or fire alerts. - Region is the partition key. One agent per region needs no coordination — there is no leader election. (Running more than one agent in the same region, or more than one control plane, is out of scope for this version.)
New targets are assigned to scheduler.default_region (empty falls back to scheduler.region). At boot the control plane reconciles the configured region rows and backfills any unassigned target to the default region, so enabling regions never leaves a target unchecked.
Running an agent
On the agent box, point at the control plane and name the region. The token carries the agent’s capability — supply it by environment variable, never in a committed file:
[agent]
enabled = true
control_plane_url = "https://app.example.com"
region = "eu-west"
pull_interval_secs = 30
flush_interval_secs = 5
buffer_capacity = 10000
UPTIMEPAGE_AGENT__TOKEN=sm_agent_… # the token minted by POST /operator/agents
The agent must reference a region and a token that already exist (see the operator surface below). Pull and ingest behaviour:
- Pull (
GET /api/agent/targets) —401/403is terminal: the agent clears its cached config and pauses, so revoking or disabling the agent stops the probe.5xx/timeout is transient: it keeps serving the last-known config. Responses are content-hashed with an ETag, so a credential re-encrypt invalidates the cache even without a config change. - Ingest (
POST /api/agent/results) — region and agent id are taken from the token, never trusted from the body. Rows that are clock-skewed or belong to a region the agent isn’t assigned are dropped per-row (the rest of the batch still lands) and counted, rather than rejecting the whole batch. Cross-process de-duplication is authoritative in ClickHouse; a re-sent identical batch is idempotent.
Operator surface
Regions and agents are managed instance-wide (across all tenants) under /operator/*, gated by a static bearer secret. Set it by environment variable; an empty value disables the surface entirely (it 404s, so it is invisible when off):
UPTIMEPAGE_OPERATOR__ADMIN_TOKEN=…
Authorization: Bearer <that-secret>
| Method | Path | Purpose |
|---|---|---|
GET | /operator/regions | list regions |
POST | /operator/regions | create a region (id is a [a-z0-9-] slug, name, optional location) |
PATCH | /operator/regions/{id} | rename / relocate, or enable / disable a region (enabled) |
DELETE | /operator/regions/{id} | delete a region — 409 while it still holds agents or assigned targets |
GET | /operator/agents | list agents |
POST | /operator/agents | mint an agent — the response carries its sm_agent_… token once |
PATCH | /operator/agents/{id} | rename / enable / disable an agent |
DELETE | /operator/agents/{id} | delete an agent |
The agent token is shown only at creation; store it when it is minted. Disabling an agent is immediately enforced on its next pull. There is no token-rotation endpoint yet — rotate by deleting and re-creating the agent.
Disabling a region stops it being scheduled and stops config-pull for it (its agents receive no targets) while keeping its stored history — a reversible alternative to deleting, which the foreign keys block while the region is in use.
A typical bring-up: create the region, mint an agent in it, copy the token to the agent box’s UPTIMEPAGE_AGENT__TOKEN, start the agent.
Viewing per-region data
Once results carry a region, the operator surfaces let you slice by it:
- Dashboard — a
region:filter in the subhead (shown only when the org spans more than one region) scopes every fleet metric to one region.?region=is reflected in the URL. - Monitor detail — a region selector scopes the KPI cards, latency and breakdown charts, and recent results. In the all-regions view the latency chart overlays one p95 line per region, and a by region table summarises uptime, p50, p95, and last status per region. Pick a region to drill into a single line.
- REST API —
/api/v1/targets/{id}/results,/latency, and/uptimeaccept an optionalregion=query parameter;/api/v1/targets/{id}/latency/by-regionreturns one series per region.GET /api/v1/regionslists the enabled region catalog andGET/PUT /api/v1/targets/{id}/regionsread and set a monitor’s assignment — all undertargets:read/targets:write. See REST API.
What deliberately blends across regions: the public status page’s component status (the public “is it up” answer is region-agnostic by design), the monitors list, and incident timelines. Those aggregate every region so a viewer sees one verdict.
Incident detection across regions
Detection evaluates each region’s recent run independently and then combines the verdicts, so one region’s transient network blip can’t corrupt the picture for a target probed from several places. There is always exactly one incident per target — its region is unset.
How the per-region verdicts combine is a per-monitor policy, set on the monitor form (default majority):
- any — open as soon as a single region is sustained-unhealthy.
- majority — open once more than half the regions agree it’s down (the standard defence against a single-location false positive).
- all — open only when every region is down.
- count: N — open once at least N regions are down.
A monitor probed from a single region behaves the same under every policy.
See Configuration for the [scheduler], [agent], and [operator] keys, and Architecture for where the pieces sit.
Authentication
uptimepage ships with an in-binary auth stack: GitHub OAuth for the operator UI, opaque per-user API tokens for the REST surface, and optional magic-link sign-in for users without a GitHub identity. The binary always runs as multi-tenant SaaS — single-tenant deployments are just SaaS with one signed-up user; see Multi-tenancy for the full model.
Concepts
- User. A row in
users, keyed by id. Email is CITEXT. A user can belong to multiple orgs. - Session. A 32-byte random id (43 base64url chars) stored in a
HttpOnly; Secure; SameSite=Laxcookie, default_sm_session. Backed by asessionsrow with idle + absolute timeouts. - API token. An opaque bearer token (
sm_live_…) presented in theAuthorization: Bearer …header. Stored as an argon2id hash plus a 16-char prefix for indexed lookup. Returned once at create time and never again. - Org. Container for the user-visible data (targets, incidents,
maintenance, …). Memberships carry a role:
Owner,Member. - Invitation. A pending row in
invitationscarrying an argon2id hash of a single-use token sent to a prospective member’s email. - Magic-link token. A single-use row in
magic_link_tokens(auth.magic_link.expiry_minutes, default 15). Enabled by default; gated byauth.enabled_methods.
Flows
OAuth sign-in (GitHub, Google)
Both providers share one callback runner; only the upstream identity fetch differs. The callback is split into three strict phases:
- Phase A —
DELETE … RETURNINGconsumes theoauth_statesrow in one statement (provider-bound: a state minted for one provider cannot complete another’s callback). No upstream call has happened yet, so the DB connection is released before any HTTP. - Phase B — exchange
codefor an access token, then fetch the profile: GitHub/user+/user/emails(verified primary only), Google OIDC userinfo (email accepted only withemail_verified). No DB connection is held. - Phase C — a fresh transaction materialises the user + identity, links a new provider to an existing account on verified-email match (restoring a soft-deleted account if needed), auto-creates a signup org if this is a new sign-up, and commits. The user’s default org (oldest active membership) is resolved after commit for the session row.
After commit, the previous session cookie (if any) is destroyed for session-fixation defence, a fresh session row is INSERTed, the cookie is set, and the user is redirected. Failure modes:
- Invalid or expired state → 400
INVALID_STATE, logged tologin_attempts. - User denied consent / provider sent no code → redirect back to
/login, logged withfailure_reason = "oauth_denied"(or"missing_code"). - Upstream failure → 500, logged with
failure_reason = "oauth_upstream_failed"(rows from before 2026-06 carry the old"github_upstream_failed"). - Disabled (
enabled_methods) or incompletely configured provider → 404AUTH_METHOD_UNAVAILABLEon both start and callback; the listed-but-misconfigured case logs a warning.
API token auth
Bearer tokens skip the cookie path entirely. The middleware checks the
Authorization: Bearer … header against the api_tokens table via the
indexed token_prefix (first 16 chars of the raw token), then
argon2-verifies the survivor. last_used_at is updated through the same
60-second debounce as session cookies.
CSRF protection does not apply: cross-origin browsers don’t auto-attach
the Authorization header, so there is no forgery surface.
To manage resources with a token as code, see Terraform. To let an LLM client query and act on an org with a token, see the MCP server.
Scopes
Every token carries a set of resource:action scopes. A request is rejected with 403 INSUFFICIENT_SCOPE unless the token holds the scope its endpoint requires. full_access is a superset that grants all of them; unknown scope strings are ignored (forward-compatible).
| Resource | read | write | delete | execute |
|---|---|---|---|---|
targets | list / get / results / uptime / latency / incident history | create / update / bulk | delete, bulk-delete | run a check now, test-probe a config |
channels | list / get | create / update | delete | send a test notification |
incidents | — (target incident history is under targets:read; the public timeline needs no token) | narrate / post update | — | — |
maintenance | list / get | create / update | delete | — |
status_page | read settings | update settings, upload logo | remove logo | — |
write implies read for the same resource. delete and execute are independent — they are not granted by write, so a config-management token (*:write) can change resources but cannot destroy them or trigger side effects. Grant delete/execute explicitly when you need them.
Org binding
A token is user-scoped, so each request names an org via the X-Uptimepage-Org: <slug> header. A token can additionally be bound to one org at creation:
- Bound — the header is optional; if sent it must name the bound org, else
403 ORG_HEADER_MISMATCH. The token can never act on the user’s other orgs. - Unbound — the header is required (
400 ORG_REQUIREDif absent). A malformed/unknown slug is400 ORG_HEADER_INVALIDon either kind.
Expiry
A token may carry an expiry (1–365 days); an expired token authenticates as invalid. Tokens without an expiry never lapse — prefer a bounded lifetime.
Managing tokens
Token management — create, list, rename, revoke — is browser-session only: these endpoints read the session cookie and reject bearer tokens, so a token can never mint another token (which would escape its own scopes) or reach account/org administration. Mint tokens in the UI at Settings → API tokens (a verified email is required).
Magic-link sign-in (gated)
Available only when auth.enabled_methods contains "magic_link":
POST /auth/magic-link/request {email}— generates a 32-byte token, hashes it, INSERTs intomagic_link_tokenswith a 15-minute expiry, and emails the verify URL via the configuredEmailSender. Anti-enumeration: the response is identical for known, unknown, and malformed emails —{"sent": true}.GET /auth/magic-link/verify?token=…— atomically marks the rowused_at = now(), destroys any pre-login session, mints a new session (restoring a soft-deleted account — email ownership is the re-auth proof), auto-accepts a carried invitation, and redirects by priority:/?joined=<slug>→/?invite=missed(carried invitation failed to redeem) →/?restored=1(welcome-back banner) → carriedredirect_after→/. An invalid, used, or expired token renders an HTML “link expired” page with status 410 — one indistinguishable state, no JSON error envelope.
The schema and email template ship in v1 even when the flow is gated, so flipping the config doesn’t require a migration.
Invitations
Owners issue invitations to email addresses. The recipient gets emailed accept/decline links embedding the raw token (single-use, hashed at rest with the same argon2id parameters as API tokens).
GET /invitations/accept?token=…— with a session, redeems right there (clicking the emailed link is the consent; email must match); without one, bounces to/login?invitation=…and every sign-in method carries the invitation through and auto-accepts after login. The session’s active org rotates to the joined org and the dashboard shows a “welcome to” banner ( /?joined=<slug>). A carried invitation that can’t be redeemed (mismatched email, seat race, revoked) never breaks the login — the dashboard shows a generic “invitation couldn’t be applied” banner instead.GET /invitations/decline?token=…— render-only confirm page (mail scanners prefetch links, so the GET never mutates); its button POSTs the decline.- A magic link requested for an unknown email that carries a valid invitation for that same address bootstraps the account at verify time: user created (verified, consent stamped, no personal org) and joined directly into the inviter’s org. Without a matching invitation, unknown emails still get the indistinguishable invalid-link page.
- A seat-race loser’s invitation is un-consumed (
accepted_atreverted), so “try again once a seat frees up” stays true.
Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
GET | /login | none | Login page (HTML) |
GET | /auth/github/login | none | Initiate GitHub OAuth |
GET | /auth/github/callback | none | Handle OAuth callback |
POST | /auth/logout | session | Destroy current session |
POST | /auth/logout-all | session | Destroy all sessions for current user |
POST | /auth/magic-link/request | none | Request magic link (gated) |
GET | /auth/magic-link/verify | none | Verify magic-link token (gated) |
GET | /auth/google/login | none | Initiate Google OAuth |
GET | /auth/google/callback | none | Handle Google OAuth callback |
GET | /invitations/accept | optional session | Emailed accept link (HTML; redeems with session, else login bounce) |
GET | /invitations/decline | none | Emailed decline link (HTML confirm page; POST does the decline) |
GET | /api/v1/me | session/token | Current user info |
GET | /api/v1/me/sessions | session | List active sessions |
DELETE | /api/v1/me/sessions/{id} | session | Revoke a session |
GET | /api/v1/me/api-tokens | session | List tokens (prefix only) |
POST | /api/v1/me/api-tokens | session | Create token (returned once) |
PATCH | /api/v1/me/api-tokens/{id} | session | Rename token |
DELETE | /api/v1/me/api-tokens/{id} | session | Revoke token |
POST | /api/v1/orgs/{org_id}/invitations | session, owner | Issue invitation |
GET | /api/v1/orgs/{org_id}/invitations | session, owner | List pending |
DELETE | /api/v1/orgs/{org_id}/invitations/{id} | session, owner | Revoke |
POST | /api/v1/invitations/accept | session | Accept (token in body) |
POST | /api/v1/invitations/decline | none | Decline (token in body) |
Security model
- CSRF. State-changing cookie-authenticated requests must carry
X-Requested-With: uptimepage. Bearer requests skip. The header is comparison-checked in constant time viasubtle::ConstantTimeEq. - Session fixation. Both the OAuth callback and the magic-link verify endpoint destroy any pre-existing session bound to the browser before minting the new one.
- Hashed PII. IP addresses and User-Agent strings in
sessions,login_attempts, andmagic_link_tokensare stored as HMAC-SHA256(salt, value) — the salt lives inauth.fingerprint_salt/auth_salt_history. Rotating the salt refuses to boot without an explicit override env var to make audit-log breakage loud. - Argon2id parameters. Default parameters from the
argon2crate (Argon2::default()). Tokens carry 256 bits of entropy, so the factor of safety is in the token, not the params. - Anti-enumeration. Magic-link request and invitation lookup return the same response whether the underlying row exists.
- Per-email send throttle.
auth.magic_link.rate_limit_seconds(default 60) caps a single address to one outgoing email per window regardless of source IP. The check runs inside the spawned send task so it never branches the response path. Concurrent requests for the same address all still INSERT (preserving anti-enum work) but only the earliest row in the window — ordered by(created_at, id)— actually mails the user. Set to0to disable.
Background workers
oauth_state_cleanup—DELETE FROM oauth_states WHERE expires_at < now()every 10 minutes.invitations::purge_old— daily cleanup of accepted/declined/expired rows older than a configurable window.magic_link_cleanup— every 6 hours whenmagic_linkis inauth.enabled_methods. Drops expired rows and used rows older than 7 days (the forensic window for “was this token redeemed?”). When the method is disabled the routes 404 and no rows are ever inserted, so the ticker stays asleep.
Sign-in audit
Every authentication attempt — success or failure — writes a row to
login_attempts:
method∈'github_oauth' | 'api_token' | 'magic_link'successbooleanfailure_reasontext ('invalid_state','token_expired','invalid_token', …)ip_hash,user_agent_hashfor forensic correlation without storing raw PII
The “recent activity” panel on the user’s settings page reads from this table.
Deployment shape
Every authenticated request carries an active org id; data writes scope through repositories that enforce isolation. The cross-tenant test suite confirms a user can’t read or mutate another org’s rows via slug URL or session token. Single-tenant deployments work the same way — they just have one user and one org. See docs/multi-tenancy.md for the data model and isolation guarantees.
MCP server
uptimepage exposes a Model Context Protocol server so an LLM client — the claude.ai connector, Claude Desktop, an IDE, or MCP Inspector — can answer operational questions about one organization and take a few guarded actions, through typed, authorized, audited tools.
It is another authorized front door to the same stores the web app and /api/v1 use, not a bypass: tenant isolation, scopes, rate limits, and audit all apply. Every tool takes the org from the credential — never from a tool argument — so a connection can only ever see and touch its own org.
- Transport — Streamable HTTP at
POST/GET /mcp, served on its own host (mcp.{DOMAIN}in production). - Auth — an org-bound scoped API token (
sm_live_…), minted either by hand (Settings → API tokens) or by the one-click OAuth 2.1 connector flow. - Surface — 7 read tools (always) + 4 write tools (each scope-gated, confirmed per action, and audited).
The server only mounts when enabled (see Enabling); a deployment that leaves it off never exposes /mcp.
Tools
All tools return typed structuredContent. Customer free text (monitor names, group names, tags, error messages, incident text) is returned as labelled data, never as instructions to the model — the server’s instructions tell the client to treat it that way.
Read tools
Side-effect-free (readOnlyHint). Require targets:read, status_page:read, or incidents:read.
| Tool | Scope | Returns |
|---|---|---|
get_org_health | targets:read | Per-state monitor totals + the worst currently-failing monitors, each with its open incident_id. The one-shot “what is broken right now?” answer — start here. |
list_monitors | targets:read | Monitors with optional state / type / tag filters, cursor-paginated; each item carries current state + last-checked time. |
get_monitor | targets:read | One monitor’s config, current state, last error, last HTTP status, and 24h / 30d uptime. |
get_monitor_history | targets:read | One monitor’s history over a window (1h / 24h / 7d / 30d): uptime, latency series, failures with error text, incident windows. |
list_incidents | incidents:read | Currently-open incidents on the org’s status pages: incident id, affected monitor, severity, latest update phase. Cursor-paginated. |
get_incident | incidents:read | One incident: affected monitor, severity, open/resolved times, error sample, and the full operator-update timeline. |
get_incident_metrics | incidents:read | Incident metrics over a trailing window (default 30 days): MTTA/MTTR, total, counts by severity and state, auto- vs human-resolved, and the noisiest monitors. |
list_status_pages | status_page:read | The org’s status pages: slug, name, public URL, enabled. Cursor-paginated. |
get_status_page | status_page:read | One status page with its components and each linked monitor’s current state. |
get_org_usage | targets:read | Resource usage against plan limits (monitors, status pages, members, components) + key policy values. |
A status-page monitor is down → get_org_health gives the incident_id → get_incident shows the timeline → acknowledge_incident posts an update. Incidents (and the incident_id / ack workflow) exist only for monitors that are status-page components; a monitor not on any status page can be failing with incident_id: null — since still reports how long it’s been down. run_check_now and get_monitor return http_status for HTTP monitors so you can tell “wrong status code” from “no response”.
Write tools
Not read-only. Each requires its scope and an interactive confirmation before it runs, and writes exactly one audit row for every outcome (success, declined, denied, error).
| Tool | Scope | Effect |
|---|---|---|
run_check_now | targets:execute | Probe a monitor immediately and record the result. A down result may fire the org’s normal alerts. |
pause_monitor | targets:write | Stop a monitor’s checks until resumed. Idempotent. |
resume_monitor | targets:write | Restart a paused monitor’s checks. Idempotent. |
acknowledge_incident | incidents:write | Post an update to an incident; it appears on the public status page. Optional phase (investigating / identified / monitoring / resolved / postmortem, default investigating) and an explicit notify choice (no default). |
Write scopes are never granted unless explicitly requested — the OAuth connector defaults to read-only (see Scopes).
Authentication
The /mcp endpoint is an OAuth 2.1 protected resource. It accepts an Authorization: Bearer sm_live_… token that must be:
- a live scoped API token,
- bound to one org (an unbound token is rejected — the connection has no org header to fall back on), held by a current member of that org,
- carrying the scope each tool requires (else
403 insufficient_scope), and - when OAuth is configured, stamped with this endpoint as its
audience(RFC 8707) — a token minted for a different audience is refused.
A request with no/invalid token gets 401 with a WWW-Authenticate: Bearer … header pointing at the resource metadata, which kicks off discovery for OAuth clients.
Two ways to get a token
1. By hand (manual connector). Mint an org-bound, read-only, expiring token in the UI (Settings → API tokens; a verified email is required) and paste it into the client. Grant the least scope you need — targets:read + status_page:read + incidents:read for the read tools. This is the simplest path for Claude Desktop / Inspector and needs only UPTIMEPAGE_MCP_ENABLED.
2. One-click OAuth (claude.ai connector). With UPTIMEPAGE_MCP_OAUTH_ENABLED on, the client discovers the authorization server, you log in with your existing session and approve a consent screen, and the server mints the same org-bound expiring token behind the scenes — no copy-paste. This is the only path that mints write scopes, and only when the consent screen’s opt-in boxes are checked.
Why OAuth at all?
The manual path works but pushes a long-lived bearer token through copy-paste and client config. OAuth replaces that with a browser consent: the user authenticates against the existing login, the connector receives a short-lived access token plus a rotating refresh token, and the connection lifetime (refresh-token lifetime) is the user’s explicit choice on the consent screen (default 90 days, max 365 — there is deliberately no “never”). Reused refresh tokens revoke the whole family. The connector never sees the user’s password and the access token is bound to this one resource.
OAuth endpoints
Discovery + authorization-server endpoints live on the app host (where the session cookie lives); the protected resource is /mcp on its own host.
| Endpoint | Host | Purpose |
|---|---|---|
/.well-known/oauth-protected-resource | resource (mcp.) | RFC 9728 resource metadata (resource id, authorization servers, scopes) |
/.well-known/oauth-authorization-server | app | RFC 8414 AS metadata (PKCE S256 only, public clients, code + refresh grants) |
/oauth/register | app | RFC 7591 Dynamic Client Registration |
/oauth/authorize | app | Login + consent screen (PKCE S256, RFC 8707 resource) |
/oauth/token | app | Issue / refresh the audience-bound token |
Redirect URIs are restricted to HTTPS hosts (web connectors) and loopback HTTP (local tooling); custom schemes, non-loopback cleartext, userinfo, and fragments are rejected at registration.
Consent screen
GET /oauth/authorize renders the consent screen — the one page the user sees during the OAuth flow. It appears after login, once the client + redirect URI are validated, and only when mcp.oauth_enabled is on. Approving here is what mints the token; nothing is granted until the user clicks Approve.
It shows:
- Who and what — the client name and the single org it’s connecting to. Access is always scoped to that one org.
- Granted abilities — one line per scope, in plain language (e.g. “Read your monitors and their current status”, “Pause and resume your monitors”). Write abilities are flagged with a ⚠ marker, and a warning banner appears at the top stating the connection can make changes — each of which still asks for per-action confirmation.
- Connection expires — a picker (30 / 60 / 90 / 365 days, default 90) that sets the refresh-token (connection) lifetime. There is no “never”.
- Approve / Deny — Deny aborts the flow; Approve mints the org-bound scoped token and returns the user to the client.
A read-only request shows “wants read-only access” with no warning banner; a request that includes any write scope switches to the “is requesting access” wording plus the banner and ⚠ markers.
Scopes
The connector advertises six grantable scopes. A request with no scope (or only unknown scopes) grants the read-only default; write scopes are opt-in.
| Scope | Grants | In default set? |
|---|---|---|
targets:read | all read tools over monitors | ✅ |
status_page:read | status-page read tools | ✅ |
incidents:read | list_incidents, get_incident | ✅ |
targets:write | pause_monitor, resume_monitor | opt-in |
targets:execute | run_check_now | opt-in |
incidents:write | acknowledge_incident | opt-in |
A granted write scope is necessary but not sufficient — every write tool still asks the user to confirm the specific action at call time.
Confirmations
Before any write tool acts, the server sends an MCP elicitation request describing the exact action (the monitor’s name, the effect, and — for acknowledge_incident — the message and notify choice). The tool proceeds only on an explicit approval; a decline, a dismissal, or a client that can’t elicit all fail closed with not_confirmed. There is no “remember my choice” — each action is confirmed on its own.
Audit
Every write-tool invocation writes one row to mcp_audit, on every path — success, user-declined, scope-denied, bad input, not-found, or server error — recording: actor_type = mcp, the token id, the acting user + org, the tool name, the arguments, the outcome (success / denied / error), and a short detail code. The same event is emitted to tracing. Reads are not audit-logged (they’re side-effect-free and already rate-limited).
Enabling
Off by default. Config keys (TOML under [mcp], or env with the UPTIMEPAGE_ prefix and __ nested separator):
| Key | Env | Default | Purpose |
|---|---|---|---|
mcp.enabled | UPTIMEPAGE_MCP__ENABLED | false | Mount /mcp + the read/write tools. |
mcp.oauth_enabled | UPTIMEPAGE_MCP__OAUTH_ENABLED | false | Add the OAuth 2.1 endpoints that back the one-click connector. |
mcp.resource_uri | UPTIMEPAGE_MCP__RESOURCE_URI | empty | Canonical absolute URI of /mcp — the OAuth resource id + RFC 8707 audience, e.g. https://mcp.uptimepage.dev/mcp. Empty disables audience binding (static-token mode). |
mcp.allowed_origins | UPTIMEPAGE_MCP__ALLOWED_ORIGINS | empty | RFC 6454 Origin allow-list (DNS-rebinding defense). Empty disables the check; a missing Origin header always passes (non-browser clients send none). |
mcp.access_token_ttl_secs | UPTIMEPAGE_MCP__ACCESS_TOKEN_TTL_SECS | 3600 | Access-token lifetime (short; auto-renewed via the rotating refresh token). |
When OAuth is on, the app refuses to boot unless mcp.resource_uri and auth.public_base_url are real HTTPS origins — the issuer and audience must be well-formed. Migrations 016 (OAuth) + 017 (audit) must be applied.
Production (GitHub-managed)
The deploy pipeline upserts the two switches from repo variables (Settings → Secrets and variables → Actions → Variables):
MCP_ENABLED=trueMCP_OAUTH_ENABLED=true
deploy.yml writes the corresponding UPTIMEPAGE_MCP_* keys into the server .env on each deploy. The resource URI defaults to https://mcp.{UPTIMEPAGE_DOMAIN}/mcp; mcp.{DOMAIN} rides the existing *.{DOMAIN} wildcard cert + Caddy route (no new DNS). See deployment/.env.example and Deployment.
Connecting a client
claude.ai connector (OAuth)
Settings → Connectors → Add custom connector → URL https://mcp.{DOMAIN}/mcp → Connect. You’ll be sent to the login + consent screen; approve, and the tools appear. This exercises the full OAuth path and is the recommended end-user flow.
Claude Desktop / IDE (manual token via mcp-remote)
mcp-remote bridges a local stdio client to the remote Streamable HTTP endpoint. Add to your client config:
{
"mcpServers": {
"uptimepage": {
"command": "npx",
"args": [
"-y", "mcp-remote",
"https://mcp.uptimepage.dev/mcp",
"--header", "Authorization: Bearer sm_live_YOUR_TOKEN"
]
}
}
}
For a local dev server over plain HTTP, add --allow-http to the args.
MCP Inspector (testing)
npx @modelcontextprotocol/inspector
Set transport Streamable HTTP, URL https://mcp.uptimepage.dev/mcp, and an Authorization: Bearer sm_live_… header. Inspector lists every tool with its schema and lets you exercise the elicitation approve/deny flow.
Examples
Raw protocol (curl)
The transport is JSON-RPC over Streamable HTTP. initialize returns a session id the client echoes on later calls.
# initialize → 200 + Mcp-Session-Id response header
curl -sD- https://mcp.uptimepage.dev/mcp \
-H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json, text/event-stream' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize",
"params":{"protocolVersion":"2025-11-25",
"capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}'
# list tools (reuse the session id from the initialize response)
curl -s https://mcp.uptimepage.dev/mcp \
-H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
-H 'Mcp-Session-Id: THE_SESSION_ID' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json, text/event-stream' \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'
# call a tool: open incidents on your status pages
curl -s https://mcp.uptimepage.dev/mcp \
-H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
-H 'Mcp-Session-Id: THE_SESSION_ID' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json, text/event-stream' \
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call",
"params":{"name":"list_incidents","arguments":{}}}'
# read one incident's timeline (id from list_incidents or get_org_health)
curl -s https://mcp.uptimepage.dev/mcp \
-H 'Authorization: Bearer sm_live_YOUR_TOKEN' \
-H 'Mcp-Session-Id: THE_SESSION_ID' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json, text/event-stream' \
-d '{"jsonrpc":"2.0","id":4,"method":"tools/call",
"params":{"name":"get_incident","arguments":{"id":"INCIDENT_ID"}}}'
Write tools (acknowledge_incident, pause_monitor, …) follow the same tools/call shape but the client must support elicitation — curl can’t approve the confirmation, so they’re driven from a real MCP client.
A missing/invalid token returns 401 with WWW-Authenticate: Bearer …; a wrong Host returns 403; a missing MCP-Protocol-Version on a non-initialize call returns 400; notifications get 202.
Asking an LLM
Once connected, drive it in natural language — the client picks the tool:
- “What’s broken in my org right now?” →
get_org_health - “Show me every DNS monitor that’s degraded.” →
list_monitors(type=dns, state=degraded) - “How has the checkout API done over the last 7 days?” →
get_monitor_history(window=7d) - “What incidents are open, and what’s been posted on them?” →
list_incidents→get_incident - “Acknowledge the payments incident — we’re investigating.” →
acknowledge_incident(phase=investigating)(asks you to confirm) - “Am I near any plan limits?” →
get_org_usage - “Run a check on the payments monitor now.” →
run_check_now(asks you to confirm; may alert) - “Pause the staging monitor.” →
pause_monitor(asks you to confirm)
Security model
- Org isolation. Org comes from the token, never an argument; the token must be org-bound and the holder a live member. The cross-tenant guarantees in Multi-tenancy apply unchanged.
- Least privilege. Read-only by default; write scopes are opt-in and each write is separately confirmed and audited.
- Audience binding. With OAuth on, tokens are pinned to this
/mcpresource (RFC 8707), so a token leaked from elsewhere can’t be replayed here. - DNS-rebinding defense. The transport enforces a Host allow-list (the configured resource host) and an optional Origin allow-list.
- Prompt-injection posture. Customer-supplied text is returned as labelled data and the server instructions tell the client not to treat it as commands — but the ultimate guard is that the dangerous tools are scope-gated and human-confirmed.
Related
- Authentication — scoped API tokens, org binding, expiry.
- Multi-tenancy — the isolation model every tool inherits.
- Quotas & rate limits — the per-plan limiter
/mcpshares. - Configuration — full config reference.
Configuration
Defaults live in config/default.toml. Every key can be overridden by an environment variable using the prefix UPTIMEPAGE_ and __ as the nested separator.
Example: UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080
Override UPTIMEPAGE_CONFIG_PATH to point at an alternate base config file.
Sections
| Section | Key | Purpose |
|---|---|---|
server | api_bind, metrics_bind | bind addresses for REST API and Prometheus exporter |
runtime | worker_threads, max_blocking_threads | Tokio runtime sizing (0 = num_cpus) |
checker | max_concurrent_checks | global concurrency cap enforced by worker pool semaphore |
checker | default_timeout_ms, connect_timeout_ms | client-side timeouts applied to outbound checks |
checker | default_check_interval_secs | fallback interval when target spec omits it |
checker | per_host_max_inflight, rdap_max_inflight | per-(org, host, port) and per-TLD RDAP concurrency caps. Fail-fast bulkhead — over-cap checks return a degraded result instead of queueing |
http_client | tcp_keepalive_secs, user_agent | per-check connection keep-alive (one request’s lifetime — checks connect fresh, no pool) and the outbound User-Agent |
dns | cache_size, positive_ttl_secs, negative_ttl_secs, servers | hickory resolver — point at internal resolvers when needed |
security | allow_private_targets | SSRF guard: when false (default) any target resolving to loopback / private / link-local / reserved IPs is rejected |
security | credentials_kek_base64 | 32-byte base64 key encrypting basic_auth / bearer_token at rest. Empty (default) stores plaintext — dev only |
circuit_breaker | failure_threshold, success_threshold, open_duration_secs, half_open_max_calls | per-host breaker state machine |
storage.postgres | url, max_connections, min_connections, acquire_timeout_secs | target metadata store |
storage.clickhouse | url, database, user, password, batch_size, batch_timeout_ms, buffer_size | result sink and pipeline back-pressure |
scheduler | target_refresh_interval_secs, jitter_pct | how often the registry is reconciled against Postgres, and how much jitter is applied to each target’s tick |
scheduler | region, default_region | this control plane’s own region id (a normal region row, default "default") and the region new targets are assigned to (empty falls back to region). See Multi-region probes |
agent | enabled, control_plane_url, region, pull_interval_secs, flush_interval_secs, buffer_capacity | run this process as a stateless regional probe instead of a control plane. token is env-only (UPTIMEPAGE_AGENT__TOKEN). Off by default. See Multi-region probes |
operator | admin_token | static bearer secret for the instance-admin /operator/* surface (regions + agents). Env-only (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN); empty disables the surface (404s) |
observability | log_level, log_format | tracing-subscriber filter + JSON vs pretty output |
observability | metrics_enabled, gauge_sample_interval_ms | Prometheus exporter toggle and sampler cadence |
observability | tracing_enabled | Master on/off for OTLP trace export. Export is active only when this and observability.grafana.enabled are true |
observability.grafana | enabled, otlp_endpoint, instance_id, api_key, trace_sample_ratio | OTLP/HTTP trace export to Grafana Cloud / any OTLP collector. api_key is env-only. See Trace export below |
api.rate_limit | enabled, per_second, burst | per-IP token-bucket rate limiter on /api/v1/*. Disabled by default |
api.cors | enabled, allowed_origins, allowed_methods, allow_any_origin | browser CORS for /api/v1/*. Disabled by default. Wildcard only via allow_any_origin = true |
| notification channels | — | Not a config block. Channels are per-org runtime resources managed via the /api/v1/notification-channels API; secrets are sealed at rest with the credentials KEK |
tenancy | path_based_public_routes, subdomain_public_routes, free_tier_owner_org_limit, deletion_grace_period_days | Public-status routing shape + org limits. See Public status routing below and docs/multi-tenancy.md for the full model |
retention | check_results_days, login_attempts_days, quota_events_days, audit_log_days | Long-horizon data-retention windows for the daily 03:00-UTC purge job. Every key is bound by the job — no decorative knobs |
public_status | base_domain, cache_max_orgs, cache_ttl_secs, last_good_ttl_secs, logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px, default_brand_color, default_show_powered_by, public_per_ip_rate_limit_per_min | Per-org public status pages at {slug}.{base_domain}. See Public status page below and Per-org status pages |
auth | enabled_methods, fingerprint_salt, public_base_url | Sign-in methods, HMAC salt for IP/UA hashes, base URL embedded in invitation + magic-link emails. See Auth configuration below |
auth.session | idle_timeout_days, absolute_timeout_days, cookie_name, cookie_secure, cookie_domain, renew_on_use | Session cookie shape + lifetime. cookie_secure = true in production |
auth.github | client_id, client_secret, redirect_url, scopes | GitHub OAuth client. The button renders on /login only when client_id, client_secret, and redirect_url are all set |
auth.google | client_id, client_secret, redirect_url, scopes | Google OAuth client, same gating as auth.github. Email is trusted only with Google’s email_verified attestation |
auth.api_tokens | max_per_user, prefix_visible_chars | Cap per user, indexed prefix length for token lookup |
auth.invitations | expiry_hours, max_pending_per_org | Invitation lifetime and per-org pending cap |
auth.magic_link | expiry_minutes, rate_limit_seconds | Magic-link token lifetime. Routes only mount when enabled_methods includes "magic_link" |
mcp | enabled, oauth_enabled, resource_uri, allowed_origins, access_token_ttl_secs | LLM connector (MCP) server at /mcp. Off by default; OAuth requires real HTTPS resource_uri + auth.public_base_url. See MCP server |
email | provider, from_name, from_address | Transactional email backend. provider ∈ "resend" | "log" | "memory" |
email.resend | api_key, webhook_secret | api_key required when email.provider = "resend". A set webhook_secret (the endpoint’s Svix whsec_… signing secret) mounts POST /hooks/resend: a permanently bounced or spam-complaining address gets every email channel pointed at it disabled, with the reason shown on the channel form |
whatsapp_app | enabled, access_token, phone_number_id, public_number, app_secret, verify_token, template_name, language_code | Operator WhatsApp number behind one-tap whatsapp_app channels (wa.me deep link + /hooks/whatsapp Meta webhook). enabled = true AND complete creds mount the surface — the flag is a deliberate spend gate, since alert sends are operator-paid Meta template messages. Inbound stop disables the sender’s channels |
Public status routing
uptimepage ships from one binary as a multi-tenant SaaS. The active org is always resolved from the authenticated session; there is no ambient “default org” and no compile-time self-host mode. A single-tenant deployment is just a SaaS deployment where you sign up as the first user (or seed users + organizations + memberships via a SQL one-shot).
The public status surface is gated by two independent flags because path-based and subdomain routing have opposite safety profiles:
tenancy.path_based_public_routes— serve/statusand/api/public/v1/*on the operator host, scoped to the single live org. Useful for a single-tenant deploy (one org, one page). Defaults totrue. Must be set tofalseonce you have more than one tenant — otherwise every visitor sees the lone org’s data regardless of which slug they expected.tenancy.subdomain_public_routes— serve one page per org at{slug}.{public_status.base_domain}(apex wildcard). Defaults tofalse; requires a well-formedbase_domain.
| Shape | Recommended flags | Public surface |
|---|---|---|
| Single-tenant | path_based_public_routes = true (default) | /status on the operator host (one org) |
| Multi-tenant SaaS | subdomain_public_routes = true, path_based_public_routes = false | {slug}.{base_domain} per org |
The binary refuses to boot in the dangerous combinations: subdomain_public_routes with an empty or single-label public_status.base_domain; or an auth.session.cookie_domain that overlaps the status wildcard. Each is a loud panic at startup, not a silent runtime leak. See Per-org status pages for the full model.
Org limits and the purge worker
free_tier_owner_org_limit(default3) caps how many orgs a single user can own. Soft-deleted orgs don’t count. Enforced inside the membershipINSERTso concurrent creates can’t exceed the cap.deletion_grace_period_days(default30) is how long a soft-deleted org’s slug is held and how long the original deleter has to restore it.- The soft-delete purge now runs inside the daily retention job (
src/jobs/retention.rs) at a fixed 03:00 UTC, not on a configurable interval. Each run cascades up to 10 past-grace orgs, drains any pending entries fromclickhouse_purge_queue(the outbox between PG cascade and ClickHouseALTER TABLE DELETE), hard-purges past-grace users, then enforces the[retention]windows. See Soft delete and the 30-day purge for the full implementation and failure-recovery guarantees.
The [retention] section sets the long-horizon windows. Defaults: login_attempts_days = 180, quota_events_days = 90, audit_log_days = 730. Check-result retention is not a config knob — the physical TTLs are baked into the ClickHouse tables at migration time (a value here would be silently ignored, since the TTL is never re-issued as an ALTER on boot): raw per-check rows in check_results keep 90 days, and the hourly rollup check_results_1h keeps 13 months. Those are the widest-tier ceilings; what a given plan actually sees is narrowed at read time by a per-plan window clamp (separate windows for raw forensics and chart history), so a plan change is an instant tag flip with no data rewrite. The public status page’s daily history strip still shows 90 days, and the Privacy Policy’s retention table pins these same physical windows. Session idle/absolute reaping uses [auth.session]; soft-deleted org/user grace uses tenancy.deletion_grace_period_days; OAuth-state and magic-link tokens are swept by their own short-cadence jobs.
See Multi-tenancy for the full model, slug rules, and the storage-layer isolation invariants the CI checks enforce.
Auth configuration
[auth]
enabled_methods = ["github_oauth", "google_oauth", "magic_link"]
fingerprint_salt = "" # HMAC salt for IP/UA hashes; rotate-aware
public_base_url = "https://status.example.test"
[auth.session]
idle_timeout_days = 30
absolute_timeout_days = 90
cookie_name = "_sm_session"
cookie_secure = true # set false only for plain-HTTP local dev
cookie_domain = "" # empty = host-only cookie
renew_on_use = true
[auth.github]
client_id = "" # from https://github.com/settings/developers
client_secret = ""
redirect_url = "https://status.example.test/auth/github/callback"
scopes = ["user:email", "read:user"]
[auth.google]
client_id = "" # Google Cloud Console OAuth web client
client_secret = ""
redirect_url = "https://status.example.test/auth/google/callback"
scopes = ["openid", "email", "profile"]
[auth.invitations]
expiry_hours = 168 # 7 days
max_pending_per_org = 50
[auth.api_tokens]
max_per_user = 25
prefix_visible_chars = 16 # floor; lower values fail boot
[auth.magic_link]
expiry_minutes = 15
rate_limit_seconds = 60 # per-email send throttle; 0 disables
[email]
provider = "log" # "resend" in prod, "log" in dev, "memory" in tests
from_name = "Uptimepage"
from_address = "no-reply@example.test"
[email.resend]
api_key = "" # required when provider = "resend"
webhook_secret = "" # whsec_… of the Resend webhook endpoint
[whatsapp_app] # operator WhatsApp number (one-tap linking)
enabled = false # deliberate spend gate — creds alone stay off
access_token = "" # Meta Cloud API token (env-only)
phone_number_id = "" # Cloud API sender id
public_number = "" # display number digits — the wa.me target
app_secret = "" # signs webhook deliveries (env-only)
verify_token = "" # echoed by Meta's GET subscribe handshake
template_name = "" # approved alert template, single body param
language_code = "en"
auth.enabled_methods is the policy switch per sign-in method: removing
an entry disables that method’s login start/callback (404) and hides its
button. OAuth providers additionally need client_id + client_secret +
redirect_url set — a listed but incompletely configured provider stays
hidden and logs a warning on probe. "magic_link" mounts the magic-link
request/verify endpoints and the login-page email form.
auth.fingerprint_salt is paired with the auth_salt_history table.
Rotating the value mid-deployment refuses to boot unless the override
env var documented in docs/troubleshooting.md is set — this is
deliberate so audit-trail breakage is loud.
Central Telegram bot
[telegram]
bot_token = "" # env UPTIMEPAGE_TELEGRAM__BOT_TOKEN; presence enables the feature
bot_username = "" # verified against the Bot API at boot; used for t.me deep links
webhook_secret = "" # random, 32+ chars; Telegram echoes it on every webhook delivery
Setting bot_token switches on one-tap Telegram channel linking: the
type card in the channel form, the link-code API, and the
/hooks/telegram receiver. Empty token (the default) leaves the
feature absent entirely — self-host deployments keep the
bring-your-own telegram transport, which needs no operator config.
When enabled, boot validates the trio: non-empty bot_username,
webhook_secret of 32+ characters, and an https://
auth.public_base_url (Telegram only delivers webhooks to public
https endpoints). The app then verifies the token against the Bot API
and registers the webhook on every boot; a Telegram outage logs a
warning and disables the bot for that boot instead of failing the
deploy.
All three values are operator secrets: env-only in production, never in a committed config file.
Provider OAuth connect (“Add to Slack” / “Add to Discord”)
[slack_oauth]
client_id = "" # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_ID
client_secret = "" # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_SECRET
[discord_oauth]
client_id = "" # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_ID
client_secret = "" # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_SECRET
Credentials of operator-owned OAuth apps — Slack with the
incoming-webhook scope, Discord with webhook.incoming. When a pair is
set, that provider’s panel in the channel form grows a connect button
(plus a QR variant): the provider’s consent screen picks the destination
channel and the callback stores the returned webhook as a regular
slack/discord channel — access tokens are discarded. The app’s
redirect URL must be <auth.public_base_url>/auth/slack/callback (or
…/auth/discord/callback). Empty credentials (the default) hide the
button; manual webhook paste always works. Env-only in production, never
in a committed config file.
Public status page
The [public_status] block configures the per-org public surface. It is
load-bearing only when tenancy.subdomain_public_routes = true; the
defaults are safe to leave untouched for self-host.
[public_status]
base_domain = "" # REQUIRED when subdomain_public_routes = true
cache_max_orgs = 1000 # hot + last-good cache bound
cache_ttl_secs = 10 # per-org rendered-page TTL
last_good_ttl_secs = 3600 # idle eviction for the stale-fallback layer
logo_dir = "/var/lib/uptimepage/logos"
max_logo_size_bytes = 1048576 # 1 MiB byte ceiling (pre-decode)
allowed_logo_mime_types = ["image/png", "image/jpeg", "image/webp"]
max_logo_dimension_px = 1200 # larger uploads are downscaled; decode
# is also allocation-bounded (bomb guard)
default_brand_color = "#3b82f6" # used when an org sets no colour
default_show_powered_by = true
public_per_ip_rate_limit_per_min = 60 # in-app limit behind the Caddy-side one
| Key | Purpose |
|---|---|
base_domain | parent domain for {slug}.{base_domain}. Must be multi-label; boot fails on empty/single-label when subdomain routing is on |
cache_max_orgs / cache_ttl_secs | per-org page cache size and freshness window |
last_good_ttl_secs | how long an idle org’s last-known-good snapshot is retained before eviction |
logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px | logo upload storage and limits |
default_brand_color, default_show_powered_by | fallbacks when an org leaves branding unset |
public_per_ip_rate_limit_per_min | second-layer rate limit behind the reverse proxy’s |
History-strip length (90 days) and the recent-incidents horizon (30 days)
remain hard-coded defaults in src/public_status/aggregator.rs. What a
page publishes is curated per-page — a monitor appears as a component
only while it’s bound to that page, and its presentation lives on the
binding:
| Per-page component field | Purpose |
|---|---|
| (binding exists) | the monitor is published as a component on that page |
public_name | display name (falls back to operator-side monitor name) |
public_description | optional one-liner |
public_group | optional group label; ungrouped components render last |
sort_order | ASC integer sort within a group |
See Public status page for the operator workflow and Per-org status pages for the SaaS subdomain model.
Trace export
OpenTelemetry spans are exported over OTLP/HTTP (protobuf) when both
observability.tracing_enabled and observability.grafana.enabled are
true. Disabled by default and zero-cost when off.
[observability]
tracing_enabled = false # master on/off for trace export
[observability.grafana]
enabled = false # second switch; both must be true
otlp_endpoint = "" # OTLP base, no /v1/traces suffix; e.g.
# https://otlp-gateway-<zone>.grafana.net/otlp
instance_id = "" # Grafana Cloud numeric instance / stack id
trace_sample_ratio = 0.05 # parent-based head sampling, [0.0, 1.0]
# api_key # NEVER in TOML — env var only (below)
| Key | Purpose |
|---|---|
tracing_enabled | master switch; with grafana.enabled gates all export |
grafana.enabled | second switch (kept separate so the block is inert until explicitly turned on) |
grafana.otlp_endpoint | OTLP/HTTP base URL; the service appends /v1/traces (a value already ending in it is left as-is). Empty fails boot when export is on |
grafana.instance_id | basic-auth username (Grafana Cloud instance id). Empty fails boot when export is on |
grafana.api_key | basic-auth password. Env-only: UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY. Never read from a config file; redacted in any serialised config |
grafana.trace_sample_ratio | head sampling ratio under a parent-based sampler. Must be in [0.0, 1.0] or boot fails |
Auth is Authorization: Basic base64(instance_id:api_key). Resource
attributes service.name = uptimepage and service.version are
attached. The batch exporter is flushed and stopped on graceful
shutdown. A transport build failure logs a warning and the service
continues without traces — telemetry never takes down monitoring.
Inconsistent settings (export on with a missing endpoint / instance /
key, or an out-of-range ratio) are a clean startup config error.
Tuning notes
max_concurrent_checkscaps simultaneous in-flight checks. Per-check memory is small (a tokio task plus an in-flight hyper request), so the practical ceiling is set by file descriptors and ephemeral ports rather than RAM.per_host_max_inflight(default2) is the per-tenant per-(host, port)in-flight cap. One tenant fanning a burst of checks at the same upstream looks like a probe; this cap keeps that fingerprint flat. Tenant-scoped — one customer’s burst never starves another customer’s monitor of the same host. Fail-fast: a check that would exceed the cap is recorded asdegradedwitherror="throttled: host concurrency cap"and skipped (no alert fired — the upstream is fine, the back-pressure is operator-side). Counters:uptimepage_host_throttle_waits_total{kind="host"}(attempts) anduptimepage_host_throttle_drops_total(rejections).rdap_max_inflight(default1) is the process-wide per-TLD RDAP concurrency cap (across all tenants). Daily check cadence + per-TLD slot means deep queues drain quickly without bursting any registry. Same fail-fast behavior + counters as the per-host cap.storage.clickhouse.buffer_sizeis the mpsc capacity between worker pool and batcher. Sized for ~1 s of bursts at peak RPS. Drops incrementstorage_dropped_total{reason="queue_full"}— that metric is your back-pressure signal.storage.clickhouse.batch_sizevsbatch_timeout_mstrade tail latency for throughput.1000 / 500msis a good starting point at ~20k rps.scheduler.jitter_pctprevents synchronized fleet-wide ticks. Default 10% is enough to spread N targets across an interval without making individual schedules unpredictable.dns.serversaccepts either bare IPs ("1.1.1.1") orip:portform. Used as is — no system resolver fallback.security.allow_private_targetsis the SSRF guard. Defaultfalseblocks:- Loopback (
127.0.0.0/8,::1) - RFC1918 private (
10/8,172.16/12,192.168/16) - Link-local (
169.254/16,fe80::/10) — covers AWS/GCP metadata169.254.169.254 - Carrier-grade NAT (
100.64/10) - IPv6 ULA (
fc00::/7), discard, IPv4-mapped private, documentation ranges - Multicast, broadcast, unspecified, reserved-for-future-use
- IPv6 transition mechanisms:
2002::/16(6to4) and64:ff9b::/96(NAT64) are decoded to their embedded IPv4 and rejected when the inner IPv4 falls in any blocked range The guard runs both at API submission (rejects IP-literal URLs synchronously) and after DNS resolution at connect time (catches DNS rebinding). Flip totruefor internal monitoring where private targets are the goal — operators are then responsible for network segmentation.
- Loopback (
security.credentials_kek_base64enables AES-256-GCM encryption of HTTPbasic_authandbearer_tokenvalues inside thetargets.check_specJSONB column. Generate withopenssl rand -base64 32. Each write produces a fresh 12-byte random nonce; the on-disk shape is{"$enc":"v1:<nonce>:<ciphertext>"}. When the key is unset the service logs a startup warning and stores credentials plaintext (dev-friendly upgrade path — existing plaintext rows continue to read after a key is provisioned). Rotation and KMS integration are out of scope for the current version; treat the KEK as long-lived and protect it via your secret-management of choice (env file with restricted mode, container secret, etc.). A malformed KEK fails the process at startup.api.rate_limitapplies a per-peer-IP token bucket only to/api/v1/*routes (/healthzand/readyzare excluded so liveness probes never see429).per_secondis the refill rate;burstis the bucket capacity. Excess requests get429 Too Many Requestswith aRetry-Afterheader. The bucket key is the TCP peer IP — when the service sits behind a reverse proxy, every client appears as the proxy IP, so prefer doing rate limiting at the proxy in that topology. Disabled by default; leave it off and let your reverse proxy enforce limits unless you bind the API directly to the internet.- TLS cert checks (
type = "tls_cert") open a dedicated TCP+TLS handshake per probe — separate from the HTTP check path. Recommendedinterval >= 3600so probe traffic stays light. The check accepts any cert chain (the goal is to report expiry status, not enforce trust), so an expired or self-signed cert still produces a structured result rather than a generic handshake error. - Domain expiry checks (
type = "domain_expiry") query RDAP via a process-shared outbound HTTPS client. The IANA bootstrap registry (https://data.iana.org/rdap/dns.json) is fetched lazily on first use and cached for process lifetime — a registry update or a transient bootstrap failure persists until restart. RDAP servers rate-limit clients, sointerval >= 3600is enforced server-side and daily is typical. SSRF guard does not gate these requests because the destination is an IANA-published endpoint, not the user-supplied domain.- Sticky last-good. Each successful probe persists
(expiry_at, registrar, last_success_at)to thedomain_expiry_statetable (PKtarget_id, denormalisedorg_id; every query filters on both). On a transient probe failure — throttle, timeout, registry 5xx, RDAP 404, network blip — the executor returns the cached verdict instead of flipping the monitor to Degraded/Down. For Up the customer-facingerrorfield stays empty; Degraded/Down carry aserved_stale: …annotation so operators can distinguish a stale serve from a fresh probe. Operators also see the staleness via theuptimepage_domain_expiry_stale_served_totalcounter. - Staleness ceiling: 7 days. A cached row older than 7d is treated as “registry unreachable for too long” and surfaces as a real
Error, which is alert-eligible. - Cross-tenant singleflight. Concurrent probes for the same domain coalesce to one outbound RDAP request. Cache TTL on the singleflight slot is 60s — short enough that each scheduled cycle still fetches fresh, long enough to absorb scheduler-jitter waves at scale. Counter:
uptimepage_rdap_singleflight_total{outcome="hit"|"miss"}.
- Sticky last-good. Each successful probe persists
- Notification channels are no longer global config. They are per-org runtime resources (Slack / Discord / Teams / Google Chat webhooks, generic HTTP webhook, Telegram bot, WhatsApp Cloud API) created via the
/api/v1/notification-channelsAPI; a target binds them by id in itsalertsarray. Transport secrets are sealed at rest with the credentials KEK and never echoed back. Slack POSTs{ "text": "..." }; the generic webhook POSTs the incident-notice JSON (plus any configured custom headers, optionally HMAC-signed — see docs/api.md). Notifications are driven by the incident engine and persisted per attempt, so delivery state survives a restart. The binding syntax and the monitor-level firing policy (confirmations, recovery, reminders, region quorum) are documented in docs/api.md. api.corsopens/api/v1/*to browser-origin access. Each entry inallowed_originsmust be a full origin (https://app.example.com) — wildcards are not parsed; setallow_any_origin = trueto sendAccess-Control-Allow-Origin: *explicitly. The two are mutually exclusive — combining them or enabling CORS with an empty list aborts startup.allowed_methodsis echoed in the preflight response (Access-Control-Allow-Methods);Access-Control-Allow-Headersis fixed tocontent-type, which is what the JSON API needs./healthzand/readyzare not wrapped, so liveness probes are unaffected.
Quotas & rate limits
Every organization is bound to a plan. The plan is the single source of
truth for resource quotas and per-minute rate budgets — the number a request
is enforced at is the same number the API reports back. Adding a paid tier
later is one row in the plans table plus a UI page; nothing in the
enforcement path changes.
The free plan
Shipped and seeded on first migration. Generous for a small team, bounded enough to keep abuse on a small VM cheap.
| Quota | Free | Meaning |
|---|---|---|
max_targets | 10 | Monitored targets in the org |
min_check_interval_secs | 60 | Plan-side floor on a target’s check interval. The effective floor is max(this, kind_min) — kind_min is 3600 for tls_cert / domain_expiry and 10 for http / tcp / dns. |
retention_days | 90 | Informational — actual check-result retention is the flat ClickHouse table TTL (90d for every org), not this column |
max_members | 5 | Active members in the org |
max_pending_invitations | 10 | Outstanding (unaccepted) invitations |
max_api_tokens_per_user | 5 | API tokens a single user may hold |
max_status_pages | 1 | Public status pages the org can run |
max_public_components | 10 | Distinct monitors published across all of the org’s pages (a monitor on several pages counts once) |
max_maintenance_windows | 20 | Scheduled maintenance windows |
max_notification_channels | 20 | Notification channels (Slack/webhook/Telegram/WhatsApp/SMS/…) in the org |
max_logo_size_bytes | 1048576 | Status-page logo upload ceiling (1 MiB) |
| Rate budget (per minute) | Free | Category |
|---|---|---|
api_writes_per_minute | 600 | POST/PATCH/DELETE on /api/v1/* |
api_reads_per_minute | 6000 | GET/HEAD/OPTIONS on /api/v1/* |
bulk_ops_per_minute | 30 | /api/v1/targets/bulk* |
test_now_per_minute | 60 | POST /api/v1/targets/test + the notification-channel test endpoints |
check_now_per_minute | 60 | POST /api/v1/targets/{id}/check-now |
How quotas are enforced
A resource quota is checked atomically at the write, not by a check-then-act in the handler. The friendly handler-side pre-check exists only to produce a clean error on the common, uncontended path; the race-safe guarantee is in the store:
- Targets — the count bound is inside the
INSERT(single and bulk), handed the samemax_targets. Concurrent creates atlimit - 1settle at exactlylimit, never more. - Members — the membership insert runs under a per-org advisory lock,
counts, and rolls itself back if it crossed
max_members. Re-adding an existing member stays a no-op. - Pending invitations — dedupe and the pending cap are enforced in one transaction under the same per-org lock; parallel duplicate-email invites yield exactly one row.
- Public components — flipping a target public is gated on
create,bulk, andPATCH(so “create private, then edit public” cannot bypass the cap). - API tokens — count-in-
INSERT, scoped per user, handedmax_api_tokens_per_user.
Exceeding a resource quota returns 422:
{
"error": {
"code": "QUOTA_EXCEEDED",
"message": "max_targets limit reached: 10 of 10 used on the free plan.",
"field": null,
"details": { "quota": "max_targets", "current": 10, "limit": 10, "plan": "free" },
"trace_id": null
}
}
The pending-invitation cap is the one exception to the code: it predates the
unified envelope and returns 409 INVITATIONS_LIMIT. The cap itself is
enforced identically (atomic, never overshoot).
A sub-minimum check interval is its own 422, MIN_CHECK_INTERVAL, enforced
on create and PATCH, single and bulk — a target created at the floor cannot
be edited below it. The floor is max(plan.min_check_interval_secs, kind_min):
the per-kind value (3600 for tls_cert / domain_expiry, 10 for the rest)
applies regardless of plan tier — polling an expiry probe faster than once an
hour yields no signal.
Rate limiting
Two app-side tiers, both keyed on the authenticated subject (never the
TCP peer): (org, category) and (user, category). Both are checked; the
org tier fires first because it protects shared resources. The per-minute
budget comes from the org’s plan. The request category is derived from the
path and method:
- path contains
/bulk→bulk_ops - path ends
/test→test_now - path ends
/check-now→check_now - otherwise
GET/HEAD/OPTIONS→api_reads, else →api_writes
Exceeding a budget returns 429 with a Retry-After header:
{
"error": {
"code": "RATE_LIMITED",
"message": "Too many requests.",
"field": null,
"details": { "scope": "per_org_api_writes", "retry_after_secs": 30 },
"trace_id": null
}
}
The limiter is a governor cell per (scope, category) key in a DashMap.
A janitor evicts entries idle past the threshold so the map stays bounded by
the number of active tenants, not by request volume; its lifetime is bound
to the limiter so a refactor cannot silently drop the sweep and leak the
map. Unauthenticated requests fall through untouched — per-IP limiting for
those (auth endpoints, org creation, the public status surface) is the
reverse proxy’s job; see Deployment.
Checks themselves are not rate-limited — the scheduler path never enters this middleware, so monitoring throughput is unaffected.
Every quota / rate-limit / abuse rejection is recorded to the append-only
quota_events table (event, quota_name, details, hashed IP) as
fire-and-forget — it never blocks the response. It is the data source for
abuse review.
Usage transparency
| Endpoint | Returns |
|---|---|
GET /api/v1/orgs/{id}/usage | Plan + current vs limit for every org-scoped quota, policy values, rate budgets, feature flags. Member-gated (a non-member gets the same 404 as GET /orgs/{id}). |
GET /api/v1/me/usage | The caller’s api_tokens and owned_orgs current/limit. |
The operator UI surfaces the same numbers at /settings/usage as progress
bars (an unlimited self-host limit renders as ∞). Reported limit == enforced
limit by construction: both read the same plan and the same count query.
Anti-abuse
Two deny-lists, applied when a target is created, bulk-created, updated, or
test-run. A block is a 400, audited to quota_events with
event = abuse_blocked.
- URL patterns — a case-insensitive regex set of attack-recon paths
(exposed VCS dirs,
.env, credential paths, admin panels, WordPressxmlrpcpingback, Spring actuator, backup/dump extensions, …). A match is400 URL_PATTERN_BLOCKED/ABUSE_BLOCKED. The shipped patterns and the compiled fallback are kept byte-identical by a drift guard. - Domains — a YAML deny-list (
config/abuse_denylist.yaml) matched hierarchically: listingexample.comalso blockseu.status.example.com. It carries the operator’s own domain (don’t monitor yourself) and competing uptime/status providers (monitoring another monitor forms a load-amplification chain). A match is400 DOMAIN_DENYLISTED. Dedicated monitoring SaaS are listed at the apex; multi-tenant status-page hosts are listed narrowly so legitimate vendor-status checks are not over-blocked.
The list loads once at startup; changes need a restart in this release. A bad regex or malformed YAML is a clean startup config error, never a crash loop.
Configuration
[quotas]
plan_cache_ttl_secs = 300 # org→plan cache; a plans-table edit takes
usage_cache_ttl_secs = 10 # effect within this window
A plans-table change is invisible until the plan cache’s TTL elapses (a cache hit is zero DB round-trips on the hot path), then the next lookup refetches.
Single-tenant deploys raise limits the same way SaaS does: edit (or
INSERT) the plans row the org is assigned to, or attach a
plan_overrides row with the cap fields you want to raise. There is no
config-side override knob — every quota lives in Postgres so the
audit-trail covers both modes.
Every numeric quota / rate / interval is validated at config load —
< 1 is rejected with the offending field named, never a panic in
router or limiter construction.
The reverse-proxy per-IP tiers (auth endpoints, org creation, public surface) are documented in Deployment.
Metrics
Prometheus exposition on metrics_bind (default 127.0.0.1:9090/metrics).
Series
Names below are the on-wire names exactly as registered in
src/observability/metrics.rs (observability::metrics::names) and
sampled in src/observability/sampler.rs. Dashboard queries must use
these names verbatim.
| Name | Type | Purpose |
|---|---|---|
uptimepage_checks_total{status} | counter | checks completed, partitioned by terminal status (up/down/degraded/error) |
uptimepage_checks_errors_total{kind} | counter | error breakdown by kind; currently only circuit_open is emitted (a check skipped because its host breaker was open) |
uptimepage_check_redirects_total{outcome} | counter | HTTP redirect hops (followed / limit_exceeded / invalid_location / blocked_scheme) |
uptimepage_circuit_breaker_state_changes_total{from,to} | counter | breaker state transitions |
uptimepage_storage_writes_total{store,result} | counter | batcher flush outcomes |
uptimepage_storage_dropped_results_total{reason} | counter | results dropped before reaching the sink (queue full, etc.) |
uptimepage_notifications_total{channel,kind} | counter | alert notifications dispatched |
uptimepage_notifications_failures_total{channel} | counter | notification dispatches that returned an error |
uptimepage_alerts_dropped_total{reason} | counter | incident paging signals dropped before reaching the escalation engine, by NotificationReason (opened/escalated/resolved/reopened/no_data/data_resumed). A lifecycle change never blocks on paging throughput, so a saturated signal channel drops here; the incident row stays in Postgres for the reconcile sweep |
uptimepage_notifications_dead_lettered_total{transport} | counter | incident pages that exhausted all retries without delivering, by transport |
uptimepage_telegram_send_deferred_total | counter | Telegram sends held back by the per-bot/per-chat send budget rather than sent immediately. Sustained growth means the central bot is rate-limit bound |
uptimepage_host_throttle_waits_total{kind} | counter | per-(org,host,port) (kind=host) or per-TLD RDAP (kind=rdap) throttle acquire attempts |
uptimepage_host_throttle_drops_total | counter | host-bulkhead rejections — kind=host over-cap checks recorded as degraded without firing alerts. RDAP drops do NOT increment this counter; they fall through to the sticky last-good path (see domain_expiry_stale_served_total) |
uptimepage_rdap_singleflight_total{outcome} | counter | RDAP singleflight outcome per domain — hit (cached, no outbound request) or miss (fetcher invoked) |
uptimepage_domain_expiry_stale_served_total{kind} | counter | times the domain-expiry executor served a cached last-good answer instead of a fresh probe. kind distinguishes the cause: throttled, timeout, lookup_error, or fresh_error (no usable last-good — emitted as a real Error instead) |
uptimepage_domain_expiry_state_write_failed_total | counter | failures writing the last-good cache row after a successful probe. Sustained values mean the sticky cache is going cold even though probes succeed — typical cause is Postgres write degradation |
uptimepage_scheduler_refresh_failed_total | counter | registry refresh ticks that returned an error from Postgres. Alert on a sustained rate above your normal noise floor; persistent failures put the scheduler into exponential backoff (capped at 10× the configured refresh interval) and keep workers running with cached ScheduledTarget snapshots |
uptimepage_rdap_singleflight_slots | gauge | live entries in the in-process RDAP singleflight cache. Bounded under normal load by the set of monitored domains; sudden growth signals a code path feeding non-target domains into the cache |
uptimepage_scheduler_consecutive_refresh_failures | gauge | consecutive registry refresh failures since the last success. Primary alarm signal for a stuck scheduler — page when the value stays above 5 for more than a few minutes. Resets to 0 on the first successful refresh |
uptimepage_scheduler_refresh_duration_ms | histogram | wall-clock duration of one registry refresh tick (Postgres query + decode + DashMap diff). p99 climbing into the hundreds of ms means the current full-scan refresh is starting to strain at scale — the trigger for the deferred incremental-sync work |
uptimepage_build_info{version} | counter | set to 1 once at startup so the endpoint is never empty |
uptimepage_check_duration_ms | histogram | per-check wall time. The uptimepage_check_*_ms family is exposed as histogram buckets (not summary quantiles) so percentiles aggregate correctly across regions; query with histogram_quantile() |
uptimepage_check_dns_ms | histogram | DNS resolution latency (recorded in the hickory wrapper) |
uptimepage_check_connect_ms | histogram | TCP connect latency (every HTTP check connects fresh) |
uptimepage_check_tls_ms | histogram | TLS handshake latency (per HTTPS check) |
uptimepage_check_ttfb_ms | histogram | time-to-first-byte: request sent to response headers |
uptimepage_storage_batch_size | histogram | flush batch sizes |
uptimepage_storage_write_duration_ms | histogram | flush durations |
uptimepage_telegram_send_wait_ms | histogram | wait imposed on a Telegram send by the send budget before its slot opened |
uptimepage_targets_total | gauge | targets in this process’s scheduler registry (sampled). Non-zero only where in-process probing runs; a brain doing agent-only probing reports 0 by design — use uptimepage_targets_enabled for the configured-monitor count |
uptimepage_targets_enabled{kind} | gauge | configured enabled monitors counted from Postgres, by kind. Slow-cadence inventory gauge, scrape-cached so request load never reaches Postgres; correct on a brain regardless of where probing runs |
uptimepage_users_active | gauge | non-deleted user accounts counted from Postgres. Slow-cadence inventory gauge, scrape-cached |
uptimepage_workers_in_flight | gauge | current worker-pool semaphore depth (sampled). Emitted by every probing process, so on a brain doing agent-only probing the real value is on the agent’s role=probe series, not the brain’s near-zero one |
uptimepage_result_queue_depth | gauge | depth of the result channel buffer (sampled). Present on both the agent (egress to the control plane) and the brain (ingest to storage); separate them by role |
uptimepage_circuit_breakers_open | gauge | currently-open breakers (sampled). Probe-side — read the role=probe series |
uptimepage_monitors_unmonitored | gauge | monitors whose covering probes have all gone silent (no fresh results), from the silence sweep. Distinct from down: these have no data at all |
uptimepage_agent_up{region,agent} | gauge | 1 if a regional agent checked in within the staleness window, else 0. Emitted by the control plane from agents.last_seen_at, so it covers remote agents that Alloy can’t scrape. Per-agent series can freeze on agent removal, so alerts use uptimepage_agents_enabled_down |
uptimepage_agent_last_seen_age_seconds{region,agent} | gauge | seconds since a regional agent last checked in. Climbs unbounded when an agent goes dark |
uptimepage_agents_enabled_down | gauge | count of enabled regional agents currently past the staleness window. Recomputed every sweep so it never latches. The dead-man signal for a probe region going dark |
uptimepage_region_agents_total{region} | gauge | enabled agents configured for a region — the quorum denominator. Brain-side from the agents table |
uptimepage_region_agents_up{region} | gauge | enabled agents in a region fresh within the staleness window — the quorum numerator. up / total is the region’s health fraction; up == 0 means the region’s agents have all gone stale. Recomputed each sweep; like the per-agent gauges it can freeze if a region’s last agent is removed. Covers agents Alloy can’t scrape |
uptimepage_region_checks_window{region} | gauge | checks completed in a region over the recent sampling window. Brain-side count from ClickHouse, so it covers remote agents Alloy can’t scrape. Only regions with results in the window appear |
uptimepage_region_checks_up_window{region} | gauge | checks that returned up in a region over the recent window. Divide by uptimepage_region_checks_window for the success ratio |
uptimepage_region_check_latency_p95_ms{region} | gauge | approximate p95 check latency in a region over the recent window, in ms. Goes stale for a dark region (no new rows), so gate panels on uptimepage_region_agents_up |
uptimepage_pg_pool_size | gauge | total connections held in the sqlx Postgres pool (idle + in-use). Bounded above by storage.postgres.max_connections |
uptimepage_pg_pool_idle | gauge | connections sitting idle in the Postgres pool. A persistent idle = 0 alongside in_use at the max is the saturation signal |
uptimepage_pg_pool_in_use | gauge | connections checked out of the Postgres pool right now (size − idle). Alert on a sustained high in_use / size ratio |
uptimepage_process_resident_bytes | gauge | resident set size of the process (VmRSS) in bytes. Linux only — absent on non-Linux dev runs. Early-warning signal for slow leaks ahead of the OOM killer |
uptimepage_clickhouse_max_part_count_for_partition | gauge | ClickHouse MaxPartCountForPartition (sampled from system.asynchronous_metrics). Partition-explosion early warning — climbs toward parts_to_throw_insert (default 3000) if a high-cardinality column is added to PARTITION BY |
uptimepage_http_requests_total{method,route,status} | counter | inbound HTTP requests handled. route is MatchedPath (the path-pattern with placeholders) — cardinality bounded by the static router table, never by per-tenant ids. status is bucketed 2xx/3xx/4xx/5xx/other; query sum by (status) (rate(...[5m])) for the SLO ratio |
uptimepage_http_request_duration_ms{method,route} | histogram | inbound HTTP request latency, exposed as summary quantiles (single web instance, no cross-instance merge). Query name{quantile="0.99"} for tail latency per route |
uptimepage_http_responses_inflight | gauge | inbound HTTP requests currently being served. Climbing alongside flat throughput points at handler back-pressure on a downstream pool |
uptimepage_ratelimit_drops_total{scope} | counter | HTTP 429s from the per-org / per-user rate-limit middleware. scope is the same string carried in the error body (per_org_api_writes, per_user_bulk_ops, …) so dashboards can join with record_quota_event audit rows. Abuse signal — a tenant hammering the API spikes one scope before shared resources notice |
Scrape interval of 15 s is plenty — counters are written from hot tokio tasks; histograms aggregate per bucket without lock contention.
Histogram exposition. Two forms. The uptimepage_check_*_ms family is
configured with explicit buckets and exported as a Prometheus histogram
(name_bucket{le="..."} plus name_sum / name_count); query it with
histogram_quantile(0.99, sum(rate(name_bucket[5m])) by (le)) so percentiles
pool correctly across regional agents. Every other *_ms / *_size histogram
keeps the default exposition, a Prometheus summary with precomputed
quantile series (name{quantile="0.5|0.9|0.95|0.99|0.999"}) plus name_sum
and name_count; query those as name{quantile="0.99"} directly. Gauges
carry no org_id label, these are single-instance operator metrics, not
per-tenant.
Scrape labels. The collector stamps two labels the app does not set: role (control-plane on the brain, probe on a regional agent) and, on probe series, region. The brain and a probe both emit the prober and pipeline metrics (check_*, workers_in_flight, circuit_breakers_open, result_queue_depth, storage_*, process_resident_bytes), so filter by role to read the one you mean rather than summing two processes. The Ops dashboard pins probe panels to role=probe and filters them by a $region variable; the Business dashboard reads the control-plane-only inventory gauges.
The uptimepage_region_* gauges are different: the brain emits them with a region label it sets itself (from the agents table and from ClickHouse), not a collector-stamped scrape label. They are the per-region surface on a SaaS control plane, where the regional agents are not scraped at all: liveness and quorum from the agents table (region_agents_up / _total), throughput and latency from ClickHouse (region_checks_window / _up_window / region_check_latency_p95_ms). One scrape point, cost scales with regions, not tenants or fleet size.
OpenTelemetry tracing
Spans are exported over OTLP/HTTP (protobuf) when both
observability.tracing_enabled and observability.grafana.enabled are
true. The exporter targets observability.grafana.otlp_endpoint
(the OTLP base; /v1/traces is appended) and authenticates with
Authorization: Basic base64(instance_id:api_key). The destination is
any OTLP/HTTP collector — Grafana Cloud Tempo, Jaeger, an OpenTelemetry
Collector, etc.
api_keyis read only fromUPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY— never from a file.- Sampling is parent-based over a head ratio
(
grafana.trace_sample_ratio, default0.05); a sampled parent keeps its children. - Resource attributes:
service.name = uptimepage,service.version= the build version. - Disabled by default and zero-cost when off: no exporter is built, no network egress, no per-check overhead.
- A batch exporter ships spans in the background; it is flushed and stopped on graceful shutdown so the final spans are not lost. A transport build failure logs a warning and the service continues without traces — telemetry never takes down monitoring.
Inconsistent settings (export on but endpoint/instance/key missing, or
a sample ratio outside [0.0, 1.0]) fail fast at startup as a config
error, not a runtime surprise. See
Configuration for the keys and env overrides.
HTTP connection phase timings
Every HTTP check opens a fresh connection (no pool — a monitor probes each target once per interval, so a pool rarely reused a socket, and fresh-connect is what lets the probe attribute time to each phase). check_dns_ms, check_connect_ms, and check_tls_ms are timed during that establishment and check_ttfb_ms from request-send to response headers. The same four values are written per-check into ClickHouse, which is what powers the detail-page latency-breakdown chart.
Deployment
Production deployment with Caddy + basic auth
For real-world operation, use the production stack under deployment/ in the repo. It puts a Caddy reverse proxy in front of the Rust service with:
- Automatic TLS via Let’s Encrypt (HTTP/2 and HTTP/3 on by default)
- Basic auth on the UI and API
- Postgres and ClickHouse on the internal docker network — no published ports
- ClickHouse memory-capped at ~2 GB (see
deployment/clickhouse-config.xml)
Setup:
cd deployment
cp .env.example .env
$EDITOR .env # set domain, ACME email, bcrypt hash, DB passwords, KEK
docker compose up -d
deployment/README.md is the authoritative source for setup, user management, password rotation, backups, and troubleshooting.
Authentication boundary
The Rust service ships an in-binary auth stack (GitHub OAuth + opaque API tokens; magic-link sign-in is gated by config). The native auth is the boundary; a basic-auth layer in front of Caddy would double-prompt. Single-tenant deploys behave the same way — sign up as the first user and the operator surface is yours.
/healthz and /readyz are intentionally exposed without auth so
uptime probes, load balancers, and orchestrators can hit them.
/metrics on the public domain returns 404 — scrape it on the internal
docker network instead.
The public status page (/status, /status/*, /api/public/*,
/static/*, /robots.txt, /favicon.ico) is also unauthenticated by
design — see Public status surface below.
See Authentication for the in-binary flow.
Email provider (Resend)
Transactional email (invitations, magic-link sign-in) goes through the
EmailSender trait. Production uses Resend; dev
and test default to the log provider, which writes the action URL to
the tracing log so you can copy-paste it into a browser.
Setup:
-
Create a Resend account and verify your sending domain. Resend will give you DKIM and DMARC records to add to DNS.
-
Generate an API key with
emails.sendpermission only. -
Configure the service:
[email] provider = "resend" from_name = "Acme Status" from_address = "no-reply@status.acme.test" [email.resend] api_key = "re_…"Or via env:
UPTIMEPAGE_EMAIL__PROVIDER=resend,UPTIMEPAGE_EMAIL__RESEND__API_KEY=re_…. -
auth.public_base_urlmust be set to the externally-reachable origin (e.g.https://status.acme.test); the value is embedded in the links the recipient receives.
The factory rejects boot when provider = "resend" is set without a
non-empty API key — fail-fast over send-time surprise.
Public status surface
The Caddyfile carries an @public matcher that short-circuits basic_auth for the public status paths and adds a per-IP rate limit (60 req/min) via the caddy-ratelimit plugin. The stock caddy:2-alpine image doesn’t include that plugin, so the production deployment uses a custom custom-caddy:2 image built with xcaddy:
docker build -t custom-caddy:2 - <<'EOF'
FROM caddy:2-builder AS builder
RUN xcaddy build --with github.com/mholt/caddy-ratelimit
FROM caddy:2-alpine
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
EOF
Then point the caddy service in deployment/docker-compose.yml at custom-caddy:2. Full procedure (including the opt-out path that drops the rate-limit block) is in deployment/README.md.
The same custom image carries two more per-IP zones: auth_endpoints (10/min on /auth/*, /api/v1/me, invitation accept) and org_creation (3 per 24 h on POST /api/v1/orgs). These are the edge tier; the per-org / per-user budgets the service enforces from each org’s plan are the Quotas & rate limits tier — complementary, since behind the proxy the app sees only the proxy as the peer.
Per-org subdomains (SaaS)
When tenancy.subdomain_public_routes = true, each org’s page is served at {slug}.{public_status.base_domain} (apex-wildcard shape). That needs:
- a wildcard DNS record
*.{domain}pointing at the host (plus explicit A/AAAA records for any operator subdomain —app,mail, etc. — which take precedence over the wildcard); - a wildcard TLS cert for
*.{domain}. HTTP-01 can’t validate a wildcard, so the custom Caddy image also bundlescaddy-dns/hetznerand solves the ACME DNS-01 challenge using aHETZNER_DNS_API_TOKEN(zone-edit scope) from.env. The operator host (app.{domain}) is kept on its own per-host HTTP-01 cert in a separate Caddyfile block so a wildcard-key compromise does not reach the operator surface.
The wildcard means a new org’s page works the moment its owner enables it — no per-org DNS or cert step. The end-to-end runbook (Hetzner zone setup, token scope, building the image, verifying the wildcard cert) is in deployment/README.md. The model — host routing, branding, opt-in gating, cookie scoping — is in Per-org status pages.
For the operator workflow (enabling components, narrating incidents, scheduling maintenance) see Public status page.
Docker
docker compose up -d brings up Postgres 17, ClickHouse 26.3, and the monitor on the same network. Compose env vars wire the monitor to the stack:
UPTIMEPAGE_STORAGE__POSTGRES__URL: postgres://monitor:monitor@postgres:5432/monitor
UPTIMEPAGE_STORAGE__CLICKHOUSE__URL: http://clickhouse:8123
UPTIMEPAGE_STORAGE__CLICKHOUSE__USER: monitor
UPTIMEPAGE_STORAGE__CLICKHOUSE__PASSWORD: monitor
UPTIMEPAGE_OBSERVABILITY__LOG_FORMAT: json
The runtime image is gcr.io/distroless/static-debian12:nonroot for a minimal attack surface, no shell, and no glibc. Built from a static musl binary via rust:1-alpine. Final image is 16 MB — both uptimepage and loadtest binaries fit in the same image.
Bind addresses
Defaults are loopback (127.0.0.1:8080 API, 127.0.0.1:9090 metrics). Override via env for non-loopback exposure:
UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080 \
UPTIMEPAGE_SERVER__METRICS_BIND=0.0.0.0:9090 \
./uptimepage
There is no built-in auth on the API port. Front it with a proxy or keep it on a private network. The ready-made Caddy stack under deployment/ does this for you.
Metrics shipping (Grafana Cloud)
The Prometheus /metrics endpoint can be shipped to Grafana Cloud by a
Grafana Alloy sidecar. It is opt-in: the compose stack only starts it
under the metrics profile (docker compose --profile metrics up -d),
so the default deployment is unchanged. Credentials are read from .env
(gitignored) and never written into deployment/config.alloy.
deployment/README.md (“Metrics”) is the authoritative setup, including
how to obtain the Grafana Cloud URL/token, the internal-network bind, the
ready-made dashboard, and how to verify ingestion.
Migrations
- Postgres:
migrations/postgres/*.sql, applied at startup viasqlx::migrate!(tracked in_sqlx_migrations) - ClickHouse:
migrations/clickhouse/*.sql, applied idempotently viaCREATE … IF NOT EXISTSat startup
No external migrator. The app owns its schema lifecycle symmetrically.
Resource sizing
checker.max_concurrent_checkscaps simultaneous in-flight checks- Per-check memory: small (a tokio task + an in-flight hyper request + bookkeeping)
- The practical ceiling is set by file descriptors and ephemeral ports, not RAM
- At 50k concurrent checks against external targets, RSS sits around 200-400 MB depending on response sizes
- The optional
metricsprofile adds a Grafana Alloy container (~100 MB RSS plus a small bounded remote-write WAL volume) — account for it when sizing the host if you enable it
Graceful shutdown
The binary listens for SIGINT and SIGTERM, cancels the scheduler and batcher via a shared CancellationToken, awaits both background tasks, and exits within 10 s. The batcher’s cancel branch drains any pending results before returning. A warning is logged if the deadline is exceeded.
Development
Local setup for iterating on the service. For production deployment see deployment.md.
Prerequisites
- Rust 1.95+ (edition 2024) via
rustup - Docker + Docker Compose (for Postgres + ClickHouse)
- Optional:
just(brew install just) — every workflow below has a one-wordjustrecipe equivalent. Runjustto list them.
Two workflows
| First build | Incremental | Notes | |
|---|---|---|---|
| Host workflow | ~2 min | ~3 s | cargo run natively; only deps in Docker. Best for iteration. |
| Docker dev (cargo-watch) | ~3 min | ~3 s | Source bind-mounted, rebuilds happen inside the container with a cached target/. Live reload. |
| Docker prod-shape | ~5 min | ~30 s | Rebuilds image. Matches the prod build. Use for CI-shaped smoke tests. |
Host workflow (recommended for day-to-day)
Bring up just Postgres + ClickHouse:
docker compose -f compose.dev.yml up -d
Run the binary natively:
cargo run --bin uptimepage
config/default.toml already points at localhost:5432 and localhost:8123,
so no env overrides are needed. Edit code → Ctrl-C → cargo run again.
Tear down (keeps DB volumes):
docker compose -f compose.dev.yml down
Wipe data too:
docker compose -f compose.dev.yml down -v
Docker dev workflow (live reload inside a container)
Runs the binary inside a container that bind-mounts the repo and re-runs
cargo run via cargo-watch on every
source change. The compiled target/ and the linux Tailwind CLI live in named
volumes, so they persist across restarts and don’t clash with the host build.
docker compose -f compose.dev.yml --profile dev-app up -d --build
docker compose -f compose.dev.yml logs -f uptimepage
First run takes ~3 min (toolchain + cargo-watch install + cold build + Tailwind
fetch). After that, edits to src/, templates/, or static/css/input.css
trigger an incremental rebuild + restart inside the container, typically
under 5 s.
Don’t combine this with cargo run on the host — both bind 8080.
Stop just the app (keep pg + ch up):
docker compose -f compose.dev.yml stop uptimepage
Docker prod-shape workflow (full stack via Dockerfile)
docker compose up -d --build uptimepage
The Dockerfile uses cargo-chef
to split dependency compile from app compile. The first build is slow; later
src-only edits skip the dep cook layer and finish in ~30 s.
If you have the host workflow running and want to switch to docker, stop the native binary first to free port 8080 (or stop the docker service first to free the host port).
Verify it’s up
curl http://localhost:8080/healthz # liveness
curl http://localhost:8080/readyz # readiness (DBs reachable)
Browse:
http://localhost:8080/— operator dashboardhttp://localhost:8080/status— public status pagehttp://localhost:8080/docs— Swagger UI
Operator UI locally
The dev-app container runs the same SaaS code path as production. The
host workflow (cargo run against config/default.toml) does too — the
binary is multi-tenant SaaS in every environment; a single-tenant deploy
is just a SaaS deploy with one signed-up user.
Get an authenticated owner session without GitHub OAuth:
just up-app # SaaS-mode stack; wait for "api listening"
just dev-login # seeds user+org+owner+session, prints the cookie
Then, in the browser devtools Console at http://localhost:8080:
document.cookie = "_sm_session=devsession-localtest-0000000000; path=/";
Reload — you’re the owner of “Dev Org”. The public page is at
http://devorg.lvh.me:8080/status (*.lvh.me resolves to
127.0.0.1, no /etc/hosts edit). just dev-login also prints a curl
snippet that passes the cookie directly, for API-only checks.
After editing a migration in place (pre-launch policy), the dev DB trips
sqlx’s “migration N modified” checksum guard — just db-reset drops and
recreates it (ClickHouse and the warm build cache are kept). down -v wipes
the seeded session; re-run just dev-login.
Seed a target
curl -sS -X POST http://localhost:8080/api/v1/targets \
-H 'content-type: application/json' \
-d '{
"name": "example",
"check": {"type":"http","url":"https://example.com/","method":"GET",
"timeout":5000,"follow_redirects":false,"max_redirects":0,
"expected_status":{"kind":"exact","value":200},
"headers":{},"verify_tls":true},
"interval": 60, "enabled": true, "tags": [],
"public_status": true
}'
public_status: true makes the target appear on /status and addressable via
/api/public/v1/badge.svg?component=<id>.
Seed UI fixtures
For end-to-end UI smoke (every public-page render path, varied check_spec
kinds, notification channels, alert bindings, maintenance binding, adversarial
title) use the bulk fixture script after just dev-login:
just seed-fixtures
What it seeds (under the seed-fixtures tag, idempotent):
- 14 monitors — 8 public (covering all 5 component states: Operational /
Degraded / Partial outage / Major outage / Maintenance — plus the
disabled-target and ungrouped render paths) and 6 internal exercising every
check_speckind (http / tcp / dns / tls_cert / domain_expiry). - 161 incidents — 150 resolved across 87 days (cleared the 50-incident cap so the “Older incidents →” archive link renders), 10 active in mixed phases (investigating / identified / monitoring), 1 adversarial-title incident covering the day-popover JSON-escape path.
- 90-day ClickHouse history — per-target divergent shape via
cityHash64(tid)(each component has a distinct uptime% and outage pattern), an explicit 87-89d “ancient outage” cluster on the first three targets, and a 6-day NoData gap on fix-email. - 9 notification channels — one per
ChannelConfigvariant (slack, webhook, whatsapp, discord, msteams, google_chat enabled; email enabled but unverified; telegram and telegram_app disabled), with alert bindings on fix-api / fix-db / fix-auth mixingnotify_recoveryon/off and single/multi-channel bindings. - 4 maintenance windows — 1 active (bound to fix-db), 2 upcoming, 1 past.
The script ends with a post-seed verification block that prints Postgres row counts, per-component last-5-min counters with an expected-vs-actual state matrix, an HTTP smoke against the public page, the adversarial-title escape check, and a 90-day ASCII day-strip per component. Exits non-zero on any mismatch — safe to chain in CI.
Env overrides: SLUG=<org> (default devorg), RESET_CH=0 to skip
ClickHouse purge if you want to layer additional rows on top of a prior seed
(default 1).
Then visit:
- Public status page: http://devorg.lvh.me:8080/
- Operator dashboard: http://app.lvh.me:8080/
Logging
docker-compose.yml sets the default level to:
uptimepage=debug,sqlx=warn,hyper=warn,tower_http=info,info
For the host workflow, pass it directly:
RUST_LOG="uptimepage=debug,sqlx=warn" cargo run --bin uptimepage
RUST_LOG always wins over the config file. Anyhow errors are printed with
{:#} from the public-status cache, so the full context chain shows up
without re-running with backtraces.
Stream container logs:
docker compose logs -f uptimepage
Faster builds
just setup # once: sccache + cargo-nextest, and the linker
# (mold on Linux; macOS prints an lld opt-in snippet)
just check # primes test-profile artifacts so `just test` skips
# the rebuild a `cargo check` -> `cargo test` profile
# switch would otherwise force
- Toolchain:
rust-toolchain.tomlpins 1.95 for every entrypoint (barecargo,just, rust-analyzer, CI) — no more ad-hoccargo +1.95. - Linker:
.cargo/config.tomlselectsmoldfor Linux targets, sojust, barecargo, and rust-analyzer share one build fingerprint (an envRUSTFLAGSthat differed between them would double-buildtarget/). A Linux build needsmoldinstalled —just setup. macOS is opt-in (Apple clang needs lld’s machine-specific absolute path;just setupprints the~/.cargo/config.tomlsnippet). - sccache: compile cache for local dev (
justsetsRUSTC_WRAPPERonly when present) and CI (mozilla-actions/sccache-action, withSwatinem/rust-cachereduced tocache-targets: falseso they don’t double-store). Not in the releaseDockerfile— cargo-chef already layer-caches deps there and the sccache mount wouldn’t survive CI. - CI installs the linker via
rui314/setup-mold; the dev-app container viaapk add mold+ a persistent sccache volume.
Tests
cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test
cargo test --release
cargo bench
Postgres-backed tests (e.g. bulk_create_with_ragged_tags) are #[ignore]’d
by default and no-op when DATABASE_URL is unset. Bring up the stack and opt
in. Validate schema/migration changes against a throwaway DB, not the stale
monitor one (the harness auto-applies migrations on first connect):
docker compose -f compose.dev.yml up -d
docker compose -f compose.dev.yml exec -T postgres createdb -U monitor ci_verify
# Whole ignored suite (slow — builds every test binary):
DATABASE_URL=postgres://monitor:monitor@127.0.0.1:5432/ci_verify \
cargo test -- --ignored
# One suite (fast — scope to a binary; bare `nextest run` rebuilds +
# enumerates all ~48 test binaries and looks frozen for minutes):
DATABASE_URL=postgres://monitor:monitor@127.0.0.1:5432/ci_verify \
cargo test --test status_page_settings_test -- --ignored --nocapture
Database access
docker compose exec postgres psql -U monitor -d monitor
docker compose exec clickhouse clickhouse-client -u monitor --password monitor -d monitor
Same commands work against compose.dev.yml; the service names are identical.
Web UI
The single binary serves both the /api/v1/* JSON surface and a
server-rendered HTML UI at /. Stack:
- askama 0.16 + askama_web 0.16 — compile-time HTML templates under
templates/. Type mismatches failcargo build. - HTMX 2.0.9 + json-enc — bundled under
static/js/. Powers partial swaps (filter, paginate, delete) and JSON form submission. No SPA framework. - Tailwind CSS 4 — CSS-first config in
static/css/input.css(@source,@theme,@layer components). Notailwind.config.js. - ECharts 6 — lazy-loaded from page-level
<script>tags, only where charts exist (dashboard, target detail).
build.rs runs ./bin/tailwindcss --minify before each cargo build. First
build fetches the standalone CLI (~30 MB) via scripts/fetch-tailwind.sh;
subsequent builds reuse it. After cargo build --release you have one
self-contained executable with every template, CSS byte, and vendored JS file
embedded via rust-embed.
Routes
| Path | Owner |
|---|---|
GET / | dashboard (auto-refreshes via HTMX every 5 s) |
GET /targets | targets list + filters |
GET /targets/{id} | target detail with charts and time-range nav |
GET /targets/new, /targets/{id}/edit | forms posting JSON to /api/v1/targets |
GET /web/targets/list | tbody fragment for filter/paginate swaps |
GET /web/partials/dashboard | chrome-free fragment for the 5 s refresh region |
GET /docs | Swagger UI generated from /api/openapi.json |
GET /static/* | embedded assets (css/, js/, img/) |
Every UI mutation hits an existing /api/v1/* endpoint — there are no
/web/* write routes, which keeps the API the single source of truth and
makes a future SvelteKit port a templates-only rewrite.
Adding a new page
- Add a template under
templates/(extendbase.html). - Add a
#[derive(Template, WebTemplate)]struct and handler insrc/web/views/. - Register the route in
src/web/routes.rs. - Tailwind picks up new utility classes automatically via the
@source "../../templates/**/*.html"directive.
UI tests
- Unit (render): every view in
src/web/views/ships a#[test]that renders the template with a fixtures struct and asserts on the output (presence of the HTMX hooks, redaction sentinels, table scaffolding). - End-to-end:
tests/web_e2e_test.rsdrives the merged API+web router viatower::ServiceExt::oneshot, covering dashboard / list / detail / forms / 404 paths and verifying credential redaction never leaks real values into HTML.
cargo test --lib web:: # unit render tests
cargo test --test web_e2e_test # e2e
Troubleshooting
| Symptom | Likely cause |
|---|---|
503 STATUS_DATA_UNAVAILABLE | Aggregator’s first compute failed. Check uptimepage::public_status::cache ERROR log for the actual SQL/CH error. |
docker compose up --build takes 5 min on every change | You’re on the pre-cargo-chef Dockerfile. Pull latest. |
Native cargo run fails with Connection refused | compose.dev.yml isn’t up, or you forgot to release port 8080 from a running container. |
Load test
End-to-end harness. Spawns workers driving the production check executor against in-process mock servers. Different from the micro-benchmarks, which measure single-call cost via Criterion.
cargo run --release --bin loadtest
Linux verification (Docker)
50k concurrent runs need Linux kernel knobs that macOS doesn’t expose. The compose stack ships a loadtest profile that runs the binary inside a Linux container with the required sysctls and ulimits:
docker compose --profile loadtest build loadtest
docker compose --profile loadtest run --rm loadtest
# override on the fly
docker compose --profile loadtest run --rm \
-e CONCURRENCY=100000 -e DURATION_SECS=60 loadtest
The container sets net.core.somaxconn=8192, net.ipv4.tcp_tw_reuse=1, net.ipv4.ip_local_port_range=10000 65535, and nofile=1048576 — none require --privileged since these sysctls are namespaced.
Env
| Env | Default | Purpose |
|---|---|---|
CONCURRENCY | 50000 | concurrent virtual workers |
DURATION_SECS | 30 | how long to drive load |
TIMEOUT_MS | 5000 | per-check request timeout |
MOCK_PORTS | 16 | parallel in-process mock listeners — spreads 4-tuple load to avoid loopback ephemeral-port exhaustion |
RAMP_SECS | 2 | worker start stagger window — avoids thundering-herd SYN bursts at listen() backlog |
HTTP2 | 0 | when 1, client speaks HTTP/2 with prior knowledge (RFC 7540 §3.4). Single TCP connection multiplexes many streams; necessary to drive 50k workers on macOS where ephemeral src ports cap at ~16k |
What it does
Spawns MOCK_PORTS axum servers returning 200 ok, then drives workers in a tight loop using the same build_clients + check executor the production binary uses. Prints rolling RPS during the run and total / success / rps / p50 / p95 / p99 / error-kind histogram at the end.
macOS notes
kern.ipc.somaxconncaps listener backlog at 128 per socket (hard kernel limit)- Ephemeral src port range:
49152–65535= 16,384 ports TIME_WAITlingers 30 s, holding closed ports
For 50k-concurrency runs use HTTP2=1 to fold many streams onto a few TCP connections. Linux defaults (ephemeral 32-61k, tunable somaxconn) handle 50k HTTP/1 natively.
Reference numbers
Substrate caveat. Every number below was captured on a developer laptop (Apple M1 Pro, 10 cores, 16 GB). Useful for regression detection (“did this change hurt the hot path?”) and for relative comparisons between commits — not for production capacity planning. Treat them as floors, not ceilings: a real Linux host on server hardware will outperform; a constrained VM will underperform. When sizing for production, re-run on the target topology.
macOS host (M1 Pro, 10 cores, loopback)
| Date | Config | Result |
|---|---|---|
| 2026-05-14 | CONCURRENCY=50000 MOCK_PORTS=8 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300 | 252,114 rps · 100% success · 75.7M checks · p50 181 ms · p95 283 ms · p99 393 ms |
| earlier | CONCURRENCY=50000 MOCK_PORTS=8 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300 | 151,614 rps · 100% success · 45.5M checks · p99 579 ms |
| earlier | CONCURRENCY=12000 MOCK_PORTS=24 RAMP_SECS=10 DURATION_SECS=300 (HTTP/1) | 27,894 rps · 99.79% success · p99 2.7 s |
The 2026-05-14 run is the current headline: 252 k rps sustained, p99 393 ms, zero errors over 5 minutes. Captures the hot path with the multi-tenancy work merged. Native macOS loopback on Darwin 25.4 reaches 50 k concurrent HTTP/2 without the docker crutch — the older “macOS can’t do 50 k loopback” note in earlier docs is stale.
Linux container (Docker Desktop VM on Mac)
| Date | Config | Result |
|---|---|---|
| 2026-05-14 | CONCURRENCY=50000 MOCK_PORTS=16 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300 (10 vCPU allocated) | 17,391 rps · 100% success · 5.25 M checks · p99 4.2 s · 26 timeouts |
| 2026-05-12 | CONCURRENCY=50000 MOCK_PORTS=16 RAMP_SECS=10 HTTP2=1 DURATION_SECS=300 | 93,350 rps · 100% success · 28.1 M checks · p99 1.8 s · 933 MiB RSS peak |
The 2026-05-14 docker run regressed sharply versus the 2026-05-12 reference
on the same hardware. CPU was not the bottleneck (10 vCPU allocated and not
pegged); the regression sits in the Docker Desktop networking layer —
likely the DOCKER_INSECURE_NO_IPTABLES_RAW flag and iptables-rule changes
between DD versions. Same checkout’s native run on the same box hit 252 k rps,
so the binary is fine; the VM substrate isn’t.
Docker is no longer the right way to validate this binary’s perf on macOS. Prefer the native run above; reach for a real Linux host (CI runner, staging VM) when you actually need a Linux number.
HTTP/1 vs h2c trade-off
HTTP/1 exercises connect / pool churn — closer to “monitor checks N legacy endpoints” reality. h2c stresses HTTP/2 framing and flow control — closer to “monitor checks N gRPC / modern HTTPS endpoints with ALPN”. Production monitors hit both. Default is HTTP/1; flip HTTP2=1 when ephemeral exhaustion masks signal you actually care about.
Benchmarks
Criterion micro-benchmarks under benches/. Measure execute_http_check end-to-end through the same hyper-util client path the service uses in production.
cargo bench --bench http_client
cargo bench --bench public_status_ttfb # requires `just up` (PG + CH)
Substrate caveat. Every number on this page was captured on a developer laptop (Apple M1 Pro, 10 cores, 16 GB). Useful for regression detection across commits — not for production capacity planning. A real Linux server will outperform; a constrained VM will underperform. When sizing for production, re-run on the target topology.
What the bench measures
| Bench | Unit |
|---|---|
http_check_single | one execute_http_check call against in-process axum mock, h2c prior-knowledge |
http_check_throughput | c concurrent calls via join_all, varying c ∈ {100, 1000, 10000, 50000} |
Each variant runs under two pinned topologies:
1c— server + client share one OS thread (current_threadruntime). Single-core ceiling.2c— server on its own thread, client on the bench thread. Two-core ceiling.
Pinning makes results reproducible across machines: no num_cpus() drift.
Single-core results (hyper-util, 2026-05-14)
M1 Pro, loopback h2c, mock returns 200 ok:
| Bench | Latency (median) | Throughput | Δ vs reqwest baseline |
|---|---|---|---|
http_check_single/1c | 37 µs | 26.8 K rps | −21% latency · +17% rps |
http_check_throughput/1c/c_100 | 778 µs | 128 K rps | −35% latency · +54% rps |
http_check_throughput/1c/c_1000 | 7.45 ms | 134 K rps | −36% latency · +56% rps |
http_check_throughput/1c/c_10000 | 80.6 ms | 124 K rps | −30% latency · +44% rps |
http_check_throughput/1c/c_50000 | 422 ms | 118 K rps | −31% latency · +44% rps |
One CPU sustains ~130 K checks/sec. Per-check overhead at saturation = 1/130000 ≈ 7.7 µs.
Saturation reached by c=1000. Larger concurrency = more wall time, same rps — bottleneck shifts to in-thread cooperative scheduling, not parallelism.
Two-core results (hyper-util, 2026-05-14)
For comparison only — production CPU budget should be sized off 1c.
| Bench | Latency (median) | Throughput |
|---|---|---|
http_check_single/2c | 47.7 µs | 21 K rps |
http_check_throughput/2c/c_1000 | 6.52 ms | 153 K rps |
http_check_throughput/2c/c_10000 | 76.7 ms | 130 K rps |
http_check_throughput/2c/c_50000 | 440 ms | 114 K rps |
Second core gains ~14% over 1c at saturation. Single-check latency is slower on 2c (48 µs vs 37 µs) — OS context-switch cost dominates when there’s no parallelism to amortize.
Public status page TTFB (50 orgs × 50 components)
benches/public_status_ttfb.rs provisions a 50-org × 50-component × 60-result fixture in PG + CH then times LiveAggregator::build() for one tenant.
| Metric | Value |
|---|---|
| Median | 14.0 ms |
| 95% CI | 13.1–15.1 ms |
| Outliers | 6/40 (15%) — 3 high severe |
| Spec target (p99) | < 200 ms |
Comfortably under target — the (org_id, target_id, ts) ORDER BY on ClickHouse keeps single-tenant lookups bounded; no full-scan regression. Measures the aggregator only — full HTTP TTFB to the client adds template render + serialize + compression (~5–15 ms).
Where the cycles go (historical — reqwest path)
Snapshot kept for context. samply, 15 s sample at 2c/c_10000 on the previous reqwest stack. The largest reqwest-specific cost — 7.5% on url::parse inside reqwest::redirect::TowerRedirectPolicy — disappeared with the hyper-util migration and explains a big chunk of the +44–56% throughput gain documented above.
| % of client thread | Cost | Notes |
|---|---|---|
| 7.5% | url::parse via reqwest::redirect::TowerRedirectPolicy | URL re-parsed per request even with redirect::Policy::none() — removed post-migration |
| 6.5% | kevent syscall | tokio io driver poll — inherent |
| 6.3% | _platform_memmove | h2 frame buffer copies — inherent |
| 5.0% | mach_absolute_time | tokio timer + criterion clock |
| 2.4% | hyper_util::Client::send_request | request dispatch |
| 1.5% | h2::HeaderBlock::into_encoding | HPACK encode |
| 1.5% | pthread_mutex_lock | hyper pool mutex |
| ~10% combined | h2 stream bookkeeping (pop/unlink/clone) | inherent to multiplexing |
Methodology notes
target_idis hoisted out of the iter — production uses fixed-per-target UUIDs, so payingUuid::now_v7’sgetentropysyscall per call would add ~10 µs of bench-only noise.- Mock returns
&'static str— no JSON, no allocation, no body parsing. Isolates client-side cost. - No TLS —
verify_tls: false, plainhttp://. TLS handshake amortizes over h2 connection reuse; not in this bench. - HTTP/2 prior-knowledge (RFC 7540 §3.4) — single TCP connection multiplexes streams. Without it the bench would exhaust loopback ephemeral ports past
c≈10000on macOS. - Loopback only. Real network adds RTT (dominates everything here) plus DNS + TCP connect + TLS on first request per host.
Reproducibility caveats
- macOS: no CPU isolation; Spotlight / Time Machine / runaway processes show as 5–10% outliers
- Linux:
taskset -c 0pins the bench process to a single core for cleaner1cnumbers - Apple Silicon: P-core vs E-core scheduling is opaque; results can shift ~5% run-to-run
For production capacity planning use the single-core throughput above and multiply by your CPU budget. Empirical scaling stays sub-linear past ~4c due to shared h2 connection state and pool mutex contention.
Troubleshooting
/readyz returns 503
The target store can’t be reached. Check storage.postgres.url and that Postgres is up. The readiness probe pings the store; liveness (/healthz) does not.
No metrics on /metrics
- Confirm
observability.metrics_enabled = true - Confirm
metrics_bindisn’t blocked by a local firewall uptimepage_build_infois emitted at startup so the endpoint is never truly empty — if it’s also missing, the metrics exporter never bound
Many storage_dropped_total{reason="queue_full"}
The result channel between worker pool and batcher is back-pressured.
- Raise
storage.clickhouse.buffer_size(mpsc capacity) - Raise
storage.clickhouse.batch_size(fewer round-trips per batch) - Lower
storage.clickhouse.batch_timeout_ms(more frequent flushes) - Or lower check frequency for the busiest targets (
intervalper target)
Circuit breaker stuck open
Look at uptimepage_checks_errors_total{kind} filtered by host to find the failure mode, then wait circuit_breaker.open_duration_secs for the breaker to enter half-open and probe.
Targets reporting degraded with throttled: host concurrency cap
One tenant has more concurrent monitors at the same (host, port) than checker.per_host_max_inflight allows (default 2). Over-cap checks are recorded degraded instead of running. No alert fires — the upstream is fine. Either spread the targets across more hosts, raise the cap, or rely on jitter to thin the burst. Watch uptimepage_host_throttle_drops_total to size the cap against real traffic.
domain_expiry results show served_stale: …
The fresh RDAP probe failed (throttle, timeout, registry 5xx, network blip) but the executor served the most recent successful answer from domain_expiry_state instead of flipping the monitor red. The status reflects the cached expiry_at. For Up the error field stays empty (the customer-facing surface shows nothing unusual); for Degraded/Down it carries served_stale: last_verified_age_secs=…; refresh_failed=<kind> plus the cached details so operators can distinguish a stale serve from a fresh probe.
Inspect the failure kind via uptimepage_domain_expiry_stale_served_total{kind}:
kind="throttled"— per-TLD RDAP bulkhead rejected this probe. Raisechecker.rdap_max_inflightif rampant, but the cap is also the IANA-friendliness lever.kind="timeout"— the registry took longer thancheck.timeout(per-target). Either bump the per-check timeout or wait — most registries recover in minutes.kind="lookup_error"— registry returned a non-2xx (often 404 or 5xx). If a specific TLD is stuck on 5xx, the registry is having an incident; rows keep streaming asserved_staleuntil 7 days have passed.kind="fresh_error"— no usable last-good (first probe, or the cached row is older than 7d). A realCheckStatus::Erroris emitted and is alert-eligible.
domain_expiry results have flipped to real Error after days of served_stale
The cached row in domain_expiry_state is older than the 7-day staleness ceiling, so the executor stopped masking the registry outage. Either the registry has been down for that long (act on it), or this target’s interval is so long that probes haven’t run in a week. Check last_success_at in domain_expiry_state for the target.
TLS errors against internal hosts
Set verify_tls: false on the offending target. The check executor picks between a verifying and a non-verifying hyper-util client based on the flag — both share the same DNS cache and connection-pool sizing.
400 Bad Request on POST /targets — target address ... is in a blocked range
SSRF guard rejected the target. The URL or TCP host resolves to a private / loopback / link-local / reserved IP. Verify the resolved address is what you expect. To monitor private infrastructure deliberately, set security.allow_private_targets = true and ensure network segmentation prevents abuse.
Check fails with all resolved addresses for 'host' are in blocked ranges
DNS returned only private IPs for a target the API previously accepted (hostname literal). Either the record changed or DNS rebinding is in play. The connect-time guard refuses to continue. Either fix DNS or, deliberately, enable security.allow_private_targets.
credential decryption failed errors in logs
The KEK loaded at startup can no longer decrypt rows written with a different KEK. Either security.credentials_kek_base64 was rotated without re-encrypting existing rows, or the wrong key was supplied. Compare the configured KEK against the one used to write the affected targets — there is no automatic rotation. Recovery options:
- Restore the original KEK.
- Delete and re-create the affected targets (the row decrypts cleanly when overwritten via
PATCHorPOSTunder the new key).
Startup fails with invalid credentials_kek_base64
The supplied key is not 32 bytes after base64 decode, or the string is not valid base64. Generate a fresh key with openssl rand -base64 32. URL-safe and standard base64 both decode.
400 Bad Request on PATCH /targets/{id} — basic_auth contains redaction sentinel
A client read the target back (where credentials are returned as "***") and PATCHed the full check body without re-supplying the real credential. Either send the real value, or omit check entirely from the PATCH body if only other fields are changing.
429 Too Many Requests on /api/v1/*
Per-IP rate limiter is active and the bucket is empty. Read the Retry-After header for the wait time, or raise api.rate_limit.{per_second, burst}. If every client appears to share one bucket, the service is sitting behind a reverse proxy and the peer IP is the proxy — disable the in-app limiter (api.rate_limit.enabled = false) and let the proxy enforce per-client limits instead.
ClickHouse insert fails with SchemaMismatch
Almost always a Row-derive mismatch on UUID, Enum8, or DateTime64 column types:
- UUID columns require
#[serde(with = "clickhouse::serde::uuid")]on the field - Enum8 columns require an
i8field, not&str - DateTime64 filter binds in
WHEREclauses need wrapping infromUnixTimestamp64Milli(?)— rawi64won’t coerce to DateTime64 in CH expressions
Loadtest reports connect errors at high concurrency
Loopback ephemeral port exhaustion or kernel SYN backlog overflow. See loadtest.md — set MOCK_PORTS=64, RAMP_SECS=30, or enable HTTP2=1.