Configuration
Defaults live in config/default.toml. Every key can be overridden by an environment variable using the prefix UPTIMEPAGE_ and __ as the nested separator.
Example: UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080
Override UPTIMEPAGE_CONFIG_PATH to point at an alternate base config file.
Sections
| Section | Key | Purpose |
|---|---|---|
server | api_bind, metrics_bind | bind addresses for REST API and Prometheus exporter |
runtime | worker_threads, max_blocking_threads | Tokio runtime sizing (0 = num_cpus) |
checker | max_concurrent_checks | global concurrency cap enforced by worker pool semaphore |
checker | default_timeout_ms, connect_timeout_ms | client-side timeouts applied to outbound checks |
checker | default_check_interval_secs | fallback interval when target spec omits it |
checker | per_host_max_inflight, rdap_max_inflight | per-(org, host, port) and per-TLD RDAP concurrency caps. Fail-fast bulkhead — over-cap checks return a degraded result instead of queueing |
http_client | tcp_keepalive_secs, user_agent | per-check connection keep-alive (one request’s lifetime — checks connect fresh, no pool) and the outbound User-Agent |
dns | cache_size, positive_ttl_secs, negative_ttl_secs, servers | hickory resolver — point at internal resolvers when needed |
security | allow_private_targets | SSRF guard: when false (default) any target resolving to loopback / private / link-local / reserved IPs is rejected |
security | credentials_kek_base64 | 32-byte base64 key encrypting basic_auth / bearer_token at rest. Empty (default) stores plaintext — dev only |
circuit_breaker | failure_threshold, success_threshold, open_duration_secs, half_open_max_calls | per-host breaker state machine |
storage.postgres | url, max_connections, min_connections, acquire_timeout_secs | target metadata store |
storage.clickhouse | url, database, user, password, batch_size, batch_timeout_ms, buffer_size | result sink and pipeline back-pressure |
scheduler | target_refresh_interval_secs, jitter_pct | how often the registry is reconciled against Postgres, and how much jitter is applied to each target’s tick |
scheduler | region, default_region | this control plane’s own region id (a normal region row, default "default") and the region new targets are assigned to (empty falls back to region). See Multi-region probes |
agent | enabled, control_plane_url, region, pull_interval_secs, flush_interval_secs, buffer_capacity | run this process as a stateless regional probe instead of a control plane. token is env-only (UPTIMEPAGE_AGENT__TOKEN). Off by default. See Multi-region probes |
operator | admin_token | static bearer secret for the instance-admin /operator/* surface (regions + agents). Env-only (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN); empty disables the surface (404s) |
observability | log_level, log_format | tracing-subscriber filter + JSON vs pretty output |
observability | metrics_enabled, gauge_sample_interval_ms | Prometheus exporter toggle and sampler cadence |
observability | tracing_enabled | Master on/off for OTLP trace export. Export is active only when this and observability.grafana.enabled are true |
observability.grafana | enabled, otlp_endpoint, instance_id, api_key, trace_sample_ratio | OTLP/HTTP trace export to Grafana Cloud / any OTLP collector. api_key is env-only. See Trace export below |
api.rate_limit | enabled, per_second, burst | per-IP token-bucket rate limiter on /api/v1/*. Disabled by default |
api.cors | enabled, allowed_origins, allowed_methods, allow_any_origin | browser CORS for /api/v1/*. Disabled by default. Wildcard only via allow_any_origin = true |
| notification channels | — | Not a config block. Channels are per-org runtime resources managed via the /api/v1/notification-channels API; secrets are sealed at rest with the credentials KEK |
tenancy | path_based_public_routes, subdomain_public_routes, free_tier_owner_org_limit, deletion_grace_period_days | Public-status routing shape + org limits. See Public status routing below and docs/multi-tenancy.md for the full model |
retention | check_results_days, login_attempts_days, quota_events_days, audit_log_days | Long-horizon data-retention windows for the daily 03:00-UTC purge job. Every key is bound by the job — no decorative knobs |
public_status | base_domain, cache_max_orgs, cache_ttl_secs, last_good_ttl_secs, logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px, default_brand_color, default_show_powered_by, public_per_ip_rate_limit_per_min | Per-org public status pages at {slug}.{base_domain}. See Public status page below and Per-org status pages |
auth | enabled_methods, fingerprint_salt, public_base_url | Sign-in methods, HMAC salt for IP/UA hashes, base URL embedded in invitation + magic-link emails. See Auth configuration below |
auth.session | idle_timeout_days, absolute_timeout_days, cookie_name, cookie_secure, cookie_domain, renew_on_use | Session cookie shape + lifetime. cookie_secure = true in production |
auth.github | client_id, client_secret, redirect_url, scopes | GitHub OAuth client. The button renders on /login only when client_id, client_secret, and redirect_url are all set |
auth.google | client_id, client_secret, redirect_url, scopes | Google OAuth client, same gating as auth.github. Email is trusted only with Google’s email_verified attestation |
auth.api_tokens | max_per_user, prefix_visible_chars | Cap per user, indexed prefix length for token lookup |
auth.invitations | expiry_hours, max_pending_per_org | Invitation lifetime and per-org pending cap |
auth.magic_link | expiry_minutes, rate_limit_seconds | Magic-link token lifetime. Routes only mount when enabled_methods includes "magic_link" |
mcp | enabled, oauth_enabled, resource_uri, allowed_origins, access_token_ttl_secs | LLM connector (MCP) server at /mcp. Off by default; OAuth requires real HTTPS resource_uri + auth.public_base_url. See MCP server |
email | provider, from_name, from_address | Transactional email backend. provider ∈ "resend" | "log" | "memory" |
email.resend | api_key, webhook_secret | api_key required when email.provider = "resend". A set webhook_secret (the endpoint’s Svix whsec_… signing secret) mounts POST /hooks/resend: a permanently bounced or spam-complaining address gets every email channel pointed at it disabled, with the reason shown on the channel form |
whatsapp_app | enabled, access_token, phone_number_id, public_number, app_secret, verify_token, template_name, language_code | Operator WhatsApp number behind one-tap whatsapp_app channels (wa.me deep link + /hooks/whatsapp Meta webhook). enabled = true AND complete creds mount the surface — the flag is a deliberate spend gate, since alert sends are operator-paid Meta template messages. Inbound stop disables the sender’s channels |
Public status routing
uptimepage ships from one binary as a multi-tenant SaaS. The active org is always resolved from the authenticated session; there is no ambient “default org” and no compile-time self-host mode. A single-tenant deployment is just a SaaS deployment where you sign up as the first user (or seed users + organizations + memberships via a SQL one-shot).
The public status surface is gated by two independent flags because path-based and subdomain routing have opposite safety profiles:
tenancy.path_based_public_routes— serve/statusand/api/public/v1/*on the operator host, scoped to the single live org. Useful for a single-tenant deploy (one org, one page). Defaults totrue. Must be set tofalseonce you have more than one tenant — otherwise every visitor sees the lone org’s data regardless of which slug they expected.tenancy.subdomain_public_routes— serve one page per org at{slug}.{public_status.base_domain}(apex wildcard). Defaults tofalse; requires a well-formedbase_domain.
| Shape | Recommended flags | Public surface |
|---|---|---|
| Single-tenant | path_based_public_routes = true (default) | /status on the operator host (one org) |
| Multi-tenant SaaS | subdomain_public_routes = true, path_based_public_routes = false | {slug}.{base_domain} per org |
The binary refuses to boot in the dangerous combinations: subdomain_public_routes with an empty or single-label public_status.base_domain; or an auth.session.cookie_domain that overlaps the status wildcard. Each is a loud panic at startup, not a silent runtime leak. See Per-org status pages for the full model.
Org limits and the purge worker
free_tier_owner_org_limit(default3) caps how many orgs a single user can own. Soft-deleted orgs don’t count. Enforced inside the membershipINSERTso concurrent creates can’t exceed the cap.deletion_grace_period_days(default30) is how long a soft-deleted org’s slug is held and how long the original deleter has to restore it.- The soft-delete purge now runs inside the daily retention job (
src/jobs/retention.rs) at a fixed 03:00 UTC, not on a configurable interval. Each run cascades up to 10 past-grace orgs, drains any pending entries fromclickhouse_purge_queue(the outbox between PG cascade and ClickHouseALTER TABLE DELETE), hard-purges past-grace users, then enforces the[retention]windows. See Soft delete and the 30-day purge for the full implementation and failure-recovery guarantees.
The [retention] section sets the long-horizon windows. Defaults: login_attempts_days = 180, quota_events_days = 90, audit_log_days = 730. Check-result retention is not a config knob — the physical TTLs are baked into the ClickHouse tables at migration time (a value here would be silently ignored, since the TTL is never re-issued as an ALTER on boot): raw per-check rows in check_results keep 90 days, and the hourly rollup check_results_1h keeps 13 months. Those are the widest-tier ceilings; what a given plan actually sees is narrowed at read time by a per-plan window clamp (separate windows for raw forensics and chart history), so a plan change is an instant tag flip with no data rewrite. The public status page’s daily history strip still shows 90 days, and the Privacy Policy’s retention table pins these same physical windows. Session idle/absolute reaping uses [auth.session]; soft-deleted org/user grace uses tenancy.deletion_grace_period_days; OAuth-state and magic-link tokens are swept by their own short-cadence jobs.
See Multi-tenancy for the full model, slug rules, and the storage-layer isolation invariants the CI checks enforce.
Auth configuration
[auth]
enabled_methods = ["github_oauth", "google_oauth", "magic_link"]
fingerprint_salt = "" # HMAC salt for IP/UA hashes; rotate-aware
public_base_url = "https://status.example.test"
[auth.session]
idle_timeout_days = 30
absolute_timeout_days = 90
cookie_name = "_sm_session"
cookie_secure = true # set false only for plain-HTTP local dev
cookie_domain = "" # empty = host-only cookie
renew_on_use = true
[auth.github]
client_id = "" # from https://github.com/settings/developers
client_secret = ""
redirect_url = "https://status.example.test/auth/github/callback"
scopes = ["user:email", "read:user"]
[auth.google]
client_id = "" # Google Cloud Console OAuth web client
client_secret = ""
redirect_url = "https://status.example.test/auth/google/callback"
scopes = ["openid", "email", "profile"]
[auth.invitations]
expiry_hours = 168 # 7 days
max_pending_per_org = 50
[auth.api_tokens]
max_per_user = 25
prefix_visible_chars = 16 # floor; lower values fail boot
[auth.magic_link]
expiry_minutes = 15
rate_limit_seconds = 60 # per-email send throttle; 0 disables
[email]
provider = "log" # "resend" in prod, "log" in dev, "memory" in tests
from_name = "Uptimepage"
from_address = "no-reply@example.test"
[email.resend]
api_key = "" # required when provider = "resend"
webhook_secret = "" # whsec_… of the Resend webhook endpoint
[whatsapp_app] # operator WhatsApp number (one-tap linking)
enabled = false # deliberate spend gate — creds alone stay off
access_token = "" # Meta Cloud API token (env-only)
phone_number_id = "" # Cloud API sender id
public_number = "" # display number digits — the wa.me target
app_secret = "" # signs webhook deliveries (env-only)
verify_token = "" # echoed by Meta's GET subscribe handshake
template_name = "" # approved alert template, single body param
language_code = "en"
auth.enabled_methods is the policy switch per sign-in method: removing
an entry disables that method’s login start/callback (404) and hides its
button. OAuth providers additionally need client_id + client_secret +
redirect_url set — a listed but incompletely configured provider stays
hidden and logs a warning on probe. "magic_link" mounts the magic-link
request/verify endpoints and the login-page email form.
auth.fingerprint_salt is paired with the auth_salt_history table.
Rotating the value mid-deployment refuses to boot unless the override
env var documented in docs/troubleshooting.md is set — this is
deliberate so audit-trail breakage is loud.
Central Telegram bot
[telegram]
bot_token = "" # env UPTIMEPAGE_TELEGRAM__BOT_TOKEN; presence enables the feature
bot_username = "" # verified against the Bot API at boot; used for t.me deep links
webhook_secret = "" # random, 32+ chars; Telegram echoes it on every webhook delivery
Setting bot_token switches on one-tap Telegram channel linking: the
type card in the channel form, the link-code API, and the
/hooks/telegram receiver. Empty token (the default) leaves the
feature absent entirely — self-host deployments keep the
bring-your-own telegram transport, which needs no operator config.
When enabled, boot validates the trio: non-empty bot_username,
webhook_secret of 32+ characters, and an https://
auth.public_base_url (Telegram only delivers webhooks to public
https endpoints). The app then verifies the token against the Bot API
and registers the webhook on every boot; a Telegram outage logs a
warning and disables the bot for that boot instead of failing the
deploy.
All three values are operator secrets: env-only in production, never in a committed config file.
Provider OAuth connect (“Add to Slack” / “Add to Discord”)
[slack_oauth]
client_id = "" # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_ID
client_secret = "" # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_SECRET
[discord_oauth]
client_id = "" # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_ID
client_secret = "" # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_SECRET
Credentials of operator-owned OAuth apps — Slack with the
incoming-webhook scope, Discord with webhook.incoming. When a pair is
set, that provider’s panel in the channel form grows a connect button
(plus a QR variant): the provider’s consent screen picks the destination
channel and the callback stores the returned webhook as a regular
slack/discord channel — access tokens are discarded. The app’s
redirect URL must be <auth.public_base_url>/auth/slack/callback (or
…/auth/discord/callback). Empty credentials (the default) hide the
button; manual webhook paste always works. Env-only in production, never
in a committed config file.
Public status page
The [public_status] block configures the per-org public surface. It is
load-bearing only when tenancy.subdomain_public_routes = true; the
defaults are safe to leave untouched for self-host.
[public_status]
base_domain = "" # REQUIRED when subdomain_public_routes = true
cache_max_orgs = 1000 # hot + last-good cache bound
cache_ttl_secs = 10 # per-org rendered-page TTL
last_good_ttl_secs = 3600 # idle eviction for the stale-fallback layer
logo_dir = "/var/lib/uptimepage/logos"
max_logo_size_bytes = 1048576 # 1 MiB byte ceiling (pre-decode)
allowed_logo_mime_types = ["image/png", "image/jpeg", "image/webp"]
max_logo_dimension_px = 1200 # larger uploads are downscaled; decode
# is also allocation-bounded (bomb guard)
default_brand_color = "#3b82f6" # used when an org sets no colour
default_show_powered_by = true
public_per_ip_rate_limit_per_min = 60 # in-app limit behind the Caddy-side one
| Key | Purpose |
|---|---|
base_domain | parent domain for {slug}.{base_domain}. Must be multi-label; boot fails on empty/single-label when subdomain routing is on |
cache_max_orgs / cache_ttl_secs | per-org page cache size and freshness window |
last_good_ttl_secs | how long an idle org’s last-known-good snapshot is retained before eviction |
logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px | logo upload storage and limits |
default_brand_color, default_show_powered_by | fallbacks when an org leaves branding unset |
public_per_ip_rate_limit_per_min | second-layer rate limit behind the reverse proxy’s |
History-strip length (90 days) and the recent-incidents horizon (30 days)
remain hard-coded defaults in src/public_status/aggregator.rs. What a
page publishes is curated per-page — a monitor appears as a component
only while it’s bound to that page, and its presentation lives on the
binding:
| Per-page component field | Purpose |
|---|---|
| (binding exists) | the monitor is published as a component on that page |
public_name | display name (falls back to operator-side monitor name) |
public_description | optional one-liner |
public_group | optional group label; ungrouped components render last |
sort_order | ASC integer sort within a group |
See Public status page for the operator workflow and Per-org status pages for the SaaS subdomain model.
Trace export
OpenTelemetry spans are exported over OTLP/HTTP (protobuf) when both
observability.tracing_enabled and observability.grafana.enabled are
true. Disabled by default and zero-cost when off.
[observability]
tracing_enabled = false # master on/off for trace export
[observability.grafana]
enabled = false # second switch; both must be true
otlp_endpoint = "" # OTLP base, no /v1/traces suffix; e.g.
# https://otlp-gateway-<zone>.grafana.net/otlp
instance_id = "" # Grafana Cloud numeric instance / stack id
trace_sample_ratio = 0.05 # parent-based head sampling, [0.0, 1.0]
# api_key # NEVER in TOML — env var only (below)
| Key | Purpose |
|---|---|
tracing_enabled | master switch; with grafana.enabled gates all export |
grafana.enabled | second switch (kept separate so the block is inert until explicitly turned on) |
grafana.otlp_endpoint | OTLP/HTTP base URL; the service appends /v1/traces (a value already ending in it is left as-is). Empty fails boot when export is on |
grafana.instance_id | basic-auth username (Grafana Cloud instance id). Empty fails boot when export is on |
grafana.api_key | basic-auth password. Env-only: UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY. Never read from a config file; redacted in any serialised config |
grafana.trace_sample_ratio | head sampling ratio under a parent-based sampler. Must be in [0.0, 1.0] or boot fails |
Auth is Authorization: Basic base64(instance_id:api_key). Resource
attributes service.name = uptimepage and service.version are
attached. The batch exporter is flushed and stopped on graceful
shutdown. A transport build failure logs a warning and the service
continues without traces — telemetry never takes down monitoring.
Inconsistent settings (export on with a missing endpoint / instance /
key, or an out-of-range ratio) are a clean startup config error.
Tuning notes
max_concurrent_checkscaps simultaneous in-flight checks. Per-check memory is small (a tokio task plus an in-flight hyper request), so the practical ceiling is set by file descriptors and ephemeral ports rather than RAM.per_host_max_inflight(default2) is the per-tenant per-(host, port)in-flight cap. One tenant fanning a burst of checks at the same upstream looks like a probe; this cap keeps that fingerprint flat. Tenant-scoped — one customer’s burst never starves another customer’s monitor of the same host. Fail-fast: a check that would exceed the cap is recorded asdegradedwitherror="throttled: host concurrency cap"and skipped (no alert fired — the upstream is fine, the back-pressure is operator-side). Counters:uptimepage_host_throttle_waits_total{kind="host"}(attempts) anduptimepage_host_throttle_drops_total(rejections).rdap_max_inflight(default1) is the process-wide per-TLD RDAP concurrency cap (across all tenants). Daily check cadence + per-TLD slot means deep queues drain quickly without bursting any registry. Same fail-fast behavior + counters as the per-host cap.storage.clickhouse.buffer_sizeis the mpsc capacity between worker pool and batcher. Sized for ~1 s of bursts at peak RPS. Drops incrementstorage_dropped_total{reason="queue_full"}— that metric is your back-pressure signal.storage.clickhouse.batch_sizevsbatch_timeout_mstrade tail latency for throughput.1000 / 500msis a good starting point at ~20k rps.scheduler.jitter_pctprevents synchronized fleet-wide ticks. Default 10% is enough to spread N targets across an interval without making individual schedules unpredictable.dns.serversaccepts either bare IPs ("1.1.1.1") orip:portform. Used as is — no system resolver fallback.security.allow_private_targetsis the SSRF guard. Defaultfalseblocks:- Loopback (
127.0.0.0/8,::1) - RFC1918 private (
10/8,172.16/12,192.168/16) - Link-local (
169.254/16,fe80::/10) — covers AWS/GCP metadata169.254.169.254 - Carrier-grade NAT (
100.64/10) - IPv6 ULA (
fc00::/7), discard, IPv4-mapped private, documentation ranges - Multicast, broadcast, unspecified, reserved-for-future-use
- IPv6 transition mechanisms:
2002::/16(6to4) and64:ff9b::/96(NAT64) are decoded to their embedded IPv4 and rejected when the inner IPv4 falls in any blocked range The guard runs both at API submission (rejects IP-literal URLs synchronously) and after DNS resolution at connect time (catches DNS rebinding). Flip totruefor internal monitoring where private targets are the goal — operators are then responsible for network segmentation.
- Loopback (
security.credentials_kek_base64enables AES-256-GCM encryption of HTTPbasic_authandbearer_tokenvalues inside thetargets.check_specJSONB column. Generate withopenssl rand -base64 32. Each write produces a fresh 12-byte random nonce; the on-disk shape is{"$enc":"v1:<nonce>:<ciphertext>"}. When the key is unset the service logs a startup warning and stores credentials plaintext (dev-friendly upgrade path — existing plaintext rows continue to read after a key is provisioned). Rotation and KMS integration are out of scope for the current version; treat the KEK as long-lived and protect it via your secret-management of choice (env file with restricted mode, container secret, etc.). A malformed KEK fails the process at startup.api.rate_limitapplies a per-peer-IP token bucket only to/api/v1/*routes (/healthzand/readyzare excluded so liveness probes never see429).per_secondis the refill rate;burstis the bucket capacity. Excess requests get429 Too Many Requestswith aRetry-Afterheader. The bucket key is the TCP peer IP — when the service sits behind a reverse proxy, every client appears as the proxy IP, so prefer doing rate limiting at the proxy in that topology. Disabled by default; leave it off and let your reverse proxy enforce limits unless you bind the API directly to the internet.- TLS cert checks (
type = "tls_cert") open a dedicated TCP+TLS handshake per probe — separate from the HTTP check path. Recommendedinterval >= 3600so probe traffic stays light. The check accepts any cert chain (the goal is to report expiry status, not enforce trust), so an expired or self-signed cert still produces a structured result rather than a generic handshake error. - Domain expiry checks (
type = "domain_expiry") query RDAP via a process-shared outbound HTTPS client. The IANA bootstrap registry (https://data.iana.org/rdap/dns.json) is fetched lazily on first use and cached for process lifetime — a registry update or a transient bootstrap failure persists until restart. RDAP servers rate-limit clients, sointerval >= 3600is enforced server-side and daily is typical. SSRF guard does not gate these requests because the destination is an IANA-published endpoint, not the user-supplied domain.- Sticky last-good. Each successful probe persists
(expiry_at, registrar, last_success_at)to thedomain_expiry_statetable (PKtarget_id, denormalisedorg_id; every query filters on both). On a transient probe failure — throttle, timeout, registry 5xx, RDAP 404, network blip — the executor returns the cached verdict instead of flipping the monitor to Degraded/Down. For Up the customer-facingerrorfield stays empty; Degraded/Down carry aserved_stale: …annotation so operators can distinguish a stale serve from a fresh probe. Operators also see the staleness via theuptimepage_domain_expiry_stale_served_totalcounter. - Staleness ceiling: 7 days. A cached row older than 7d is treated as “registry unreachable for too long” and surfaces as a real
Error, which is alert-eligible. - Cross-tenant singleflight. Concurrent probes for the same domain coalesce to one outbound RDAP request. Cache TTL on the singleflight slot is 60s — short enough that each scheduled cycle still fetches fresh, long enough to absorb scheduler-jitter waves at scale. Counter:
uptimepage_rdap_singleflight_total{outcome="hit"|"miss"}.
- Sticky last-good. Each successful probe persists
- Notification channels are no longer global config. They are per-org runtime resources (Slack / Discord / Teams / Google Chat webhooks, generic HTTP webhook, Telegram bot, WhatsApp Cloud API) created via the
/api/v1/notification-channelsAPI; a target binds them by id in itsalertsarray. Transport secrets are sealed at rest with the credentials KEK and never echoed back. Slack POSTs{ "text": "..." }; the generic webhook POSTs the incident-notice JSON (plus any configured custom headers, optionally HMAC-signed — see docs/api.md). Notifications are driven by the incident engine and persisted per attempt, so delivery state survives a restart. The binding syntax and the monitor-level firing policy (confirmations, recovery, reminders, region quorum) are documented in docs/api.md. api.corsopens/api/v1/*to browser-origin access. Each entry inallowed_originsmust be a full origin (https://app.example.com) — wildcards are not parsed; setallow_any_origin = trueto sendAccess-Control-Allow-Origin: *explicitly. The two are mutually exclusive — combining them or enabling CORS with an empty list aborts startup.allowed_methodsis echoed in the preflight response (Access-Control-Allow-Methods);Access-Control-Allow-Headersis fixed tocontent-type, which is what the JSON API needs./healthzand/readyzare not wrapped, so liveness probes are unaffected.