Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

Defaults live in config/default.toml. Every key can be overridden by an environment variable using the prefix UPTIMEPAGE_ and __ as the nested separator.

Example: UPTIMEPAGE_SERVER__API_BIND=0.0.0.0:8080

Override UPTIMEPAGE_CONFIG_PATH to point at an alternate base config file.

Sections

SectionKeyPurpose
serverapi_bind, metrics_bindbind addresses for REST API and Prometheus exporter
runtimeworker_threads, max_blocking_threadsTokio runtime sizing (0 = num_cpus)
checkermax_concurrent_checksglobal concurrency cap enforced by worker pool semaphore
checkerdefault_timeout_ms, connect_timeout_msclient-side timeouts applied to outbound checks
checkerdefault_check_interval_secsfallback interval when target spec omits it
checkerper_host_max_inflight, rdap_max_inflightper-(org, host, port) and per-TLD RDAP concurrency caps. Fail-fast bulkhead — over-cap checks return a degraded result instead of queueing
http_clienttcp_keepalive_secs, user_agentper-check connection keep-alive (one request’s lifetime — checks connect fresh, no pool) and the outbound User-Agent
dnscache_size, positive_ttl_secs, negative_ttl_secs, servershickory resolver — point at internal resolvers when needed
securityallow_private_targetsSSRF guard: when false (default) any target resolving to loopback / private / link-local / reserved IPs is rejected
securitycredentials_kek_base6432-byte base64 key encrypting basic_auth / bearer_token at rest. Empty (default) stores plaintext — dev only
circuit_breakerfailure_threshold, success_threshold, open_duration_secs, half_open_max_callsper-host breaker state machine
storage.postgresurl, max_connections, min_connections, acquire_timeout_secstarget metadata store
storage.clickhouseurl, database, user, password, batch_size, batch_timeout_ms, buffer_sizeresult sink and pipeline back-pressure
schedulertarget_refresh_interval_secs, jitter_pcthow often the registry is reconciled against Postgres, and how much jitter is applied to each target’s tick
schedulerregion, default_regionthis control plane’s own region id (a normal region row, default "default") and the region new targets are assigned to (empty falls back to region). See Multi-region probes
agentenabled, control_plane_url, region, pull_interval_secs, flush_interval_secs, buffer_capacityrun this process as a stateless regional probe instead of a control plane. token is env-only (UPTIMEPAGE_AGENT__TOKEN). Off by default. See Multi-region probes
operatoradmin_tokenstatic bearer secret for the instance-admin /operator/* surface (regions + agents). Env-only (UPTIMEPAGE_OPERATOR__ADMIN_TOKEN); empty disables the surface (404s)
observabilitylog_level, log_formattracing-subscriber filter + JSON vs pretty output
observabilitymetrics_enabled, gauge_sample_interval_msPrometheus exporter toggle and sampler cadence
observabilitytracing_enabledMaster on/off for OTLP trace export. Export is active only when this and observability.grafana.enabled are true
observability.grafanaenabled, otlp_endpoint, instance_id, api_key, trace_sample_ratioOTLP/HTTP trace export to Grafana Cloud / any OTLP collector. api_key is env-only. See Trace export below
api.rate_limitenabled, per_second, burstper-IP token-bucket rate limiter on /api/v1/*. Disabled by default
api.corsenabled, allowed_origins, allowed_methods, allow_any_originbrowser CORS for /api/v1/*. Disabled by default. Wildcard only via allow_any_origin = true
notification channelsNot a config block. Channels are per-org runtime resources managed via the /api/v1/notification-channels API; secrets are sealed at rest with the credentials KEK
tenancypath_based_public_routes, subdomain_public_routes, free_tier_owner_org_limit, deletion_grace_period_daysPublic-status routing shape + org limits. See Public status routing below and docs/multi-tenancy.md for the full model
retentioncheck_results_days, login_attempts_days, quota_events_days, audit_log_daysLong-horizon data-retention windows for the daily 03:00-UTC purge job. Every key is bound by the job — no decorative knobs
public_statusbase_domain, cache_max_orgs, cache_ttl_secs, last_good_ttl_secs, logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_px, default_brand_color, default_show_powered_by, public_per_ip_rate_limit_per_minPer-org public status pages at {slug}.{base_domain}. See Public status page below and Per-org status pages
authenabled_methods, fingerprint_salt, public_base_urlSign-in methods, HMAC salt for IP/UA hashes, base URL embedded in invitation + magic-link emails. See Auth configuration below
auth.sessionidle_timeout_days, absolute_timeout_days, cookie_name, cookie_secure, cookie_domain, renew_on_useSession cookie shape + lifetime. cookie_secure = true in production
auth.githubclient_id, client_secret, redirect_url, scopesGitHub OAuth client. The button renders on /login only when client_id, client_secret, and redirect_url are all set
auth.googleclient_id, client_secret, redirect_url, scopesGoogle OAuth client, same gating as auth.github. Email is trusted only with Google’s email_verified attestation
auth.api_tokensmax_per_user, prefix_visible_charsCap per user, indexed prefix length for token lookup
auth.invitationsexpiry_hours, max_pending_per_orgInvitation lifetime and per-org pending cap
auth.magic_linkexpiry_minutes, rate_limit_secondsMagic-link token lifetime. Routes only mount when enabled_methods includes "magic_link"
mcpenabled, oauth_enabled, resource_uri, allowed_origins, access_token_ttl_secsLLM connector (MCP) server at /mcp. Off by default; OAuth requires real HTTPS resource_uri + auth.public_base_url. See MCP server
emailprovider, from_name, from_addressTransactional email backend. provider"resend" | "log" | "memory"
email.resendapi_key, webhook_secretapi_key required when email.provider = "resend". A set webhook_secret (the endpoint’s Svix whsec_… signing secret) mounts POST /hooks/resend: a permanently bounced or spam-complaining address gets every email channel pointed at it disabled, with the reason shown on the channel form
whatsapp_appenabled, access_token, phone_number_id, public_number, app_secret, verify_token, template_name, language_codeOperator WhatsApp number behind one-tap whatsapp_app channels (wa.me deep link + /hooks/whatsapp Meta webhook). enabled = true AND complete creds mount the surface — the flag is a deliberate spend gate, since alert sends are operator-paid Meta template messages. Inbound stop disables the sender’s channels

Public status routing

uptimepage ships from one binary as a multi-tenant SaaS. The active org is always resolved from the authenticated session; there is no ambient “default org” and no compile-time self-host mode. A single-tenant deployment is just a SaaS deployment where you sign up as the first user (or seed users + organizations + memberships via a SQL one-shot).

The public status surface is gated by two independent flags because path-based and subdomain routing have opposite safety profiles:

  • tenancy.path_based_public_routes — serve /status and /api/public/v1/* on the operator host, scoped to the single live org. Useful for a single-tenant deploy (one org, one page). Defaults to true. Must be set to false once you have more than one tenant — otherwise every visitor sees the lone org’s data regardless of which slug they expected.
  • tenancy.subdomain_public_routes — serve one page per org at {slug}.{public_status.base_domain} (apex wildcard). Defaults to false; requires a well-formed base_domain.
ShapeRecommended flagsPublic surface
Single-tenantpath_based_public_routes = true (default)/status on the operator host (one org)
Multi-tenant SaaSsubdomain_public_routes = true, path_based_public_routes = false{slug}.{base_domain} per org

The binary refuses to boot in the dangerous combinations: subdomain_public_routes with an empty or single-label public_status.base_domain; or an auth.session.cookie_domain that overlaps the status wildcard. Each is a loud panic at startup, not a silent runtime leak. See Per-org status pages for the full model.

Org limits and the purge worker

  • free_tier_owner_org_limit (default 3) caps how many orgs a single user can own. Soft-deleted orgs don’t count. Enforced inside the membership INSERT so concurrent creates can’t exceed the cap.
  • deletion_grace_period_days (default 30) is how long a soft-deleted org’s slug is held and how long the original deleter has to restore it.
  • The soft-delete purge now runs inside the daily retention job (src/jobs/retention.rs) at a fixed 03:00 UTC, not on a configurable interval. Each run cascades up to 10 past-grace orgs, drains any pending entries from clickhouse_purge_queue (the outbox between PG cascade and ClickHouse ALTER TABLE DELETE), hard-purges past-grace users, then enforces the [retention] windows. See Soft delete and the 30-day purge for the full implementation and failure-recovery guarantees.

The [retention] section sets the long-horizon windows. Defaults: login_attempts_days = 180, quota_events_days = 90, audit_log_days = 730. Check-result retention is not a config knob — the physical TTLs are baked into the ClickHouse tables at migration time (a value here would be silently ignored, since the TTL is never re-issued as an ALTER on boot): raw per-check rows in check_results keep 90 days, and the hourly rollup check_results_1h keeps 13 months. Those are the widest-tier ceilings; what a given plan actually sees is narrowed at read time by a per-plan window clamp (separate windows for raw forensics and chart history), so a plan change is an instant tag flip with no data rewrite. The public status page’s daily history strip still shows 90 days, and the Privacy Policy’s retention table pins these same physical windows. Session idle/absolute reaping uses [auth.session]; soft-deleted org/user grace uses tenancy.deletion_grace_period_days; OAuth-state and magic-link tokens are swept by their own short-cadence jobs.

See Multi-tenancy for the full model, slug rules, and the storage-layer isolation invariants the CI checks enforce.

Auth configuration

[auth]
enabled_methods = ["github_oauth", "google_oauth", "magic_link"]
fingerprint_salt = ""                # HMAC salt for IP/UA hashes; rotate-aware
public_base_url = "https://status.example.test"

[auth.session]
idle_timeout_days = 30
absolute_timeout_days = 90
cookie_name = "_sm_session"
cookie_secure = true                 # set false only for plain-HTTP local dev
cookie_domain = ""                   # empty = host-only cookie
renew_on_use = true

[auth.github]
client_id = ""                       # from https://github.com/settings/developers
client_secret = ""
redirect_url = "https://status.example.test/auth/github/callback"
scopes = ["user:email", "read:user"]

[auth.google]
client_id = ""                       # Google Cloud Console OAuth web client
client_secret = ""
redirect_url = "https://status.example.test/auth/google/callback"
scopes = ["openid", "email", "profile"]

[auth.invitations]
expiry_hours = 168                   # 7 days
max_pending_per_org = 50

[auth.api_tokens]
max_per_user = 25
prefix_visible_chars = 16            # floor; lower values fail boot

[auth.magic_link]
expiry_minutes = 15
rate_limit_seconds = 60                # per-email send throttle; 0 disables

[email]
provider = "log"                     # "resend" in prod, "log" in dev, "memory" in tests
from_name = "Uptimepage"
from_address = "no-reply@example.test"

[email.resend]
api_key = ""                         # required when provider = "resend"
webhook_secret = ""                  # whsec_… of the Resend webhook endpoint

[whatsapp_app]                       # operator WhatsApp number (one-tap linking)
enabled = false                      # deliberate spend gate — creds alone stay off
access_token = ""                    # Meta Cloud API token (env-only)
phone_number_id = ""                 # Cloud API sender id
public_number = ""                   # display number digits — the wa.me target
app_secret = ""                      # signs webhook deliveries (env-only)
verify_token = ""                    # echoed by Meta's GET subscribe handshake
template_name = ""                   # approved alert template, single body param
language_code = "en"

auth.enabled_methods is the policy switch per sign-in method: removing an entry disables that method’s login start/callback (404) and hides its button. OAuth providers additionally need client_id + client_secret + redirect_url set — a listed but incompletely configured provider stays hidden and logs a warning on probe. "magic_link" mounts the magic-link request/verify endpoints and the login-page email form.

auth.fingerprint_salt is paired with the auth_salt_history table. Rotating the value mid-deployment refuses to boot unless the override env var documented in docs/troubleshooting.md is set — this is deliberate so audit-trail breakage is loud.

Central Telegram bot

[telegram]
bot_token = ""            # env UPTIMEPAGE_TELEGRAM__BOT_TOKEN; presence enables the feature
bot_username = ""         # verified against the Bot API at boot; used for t.me deep links
webhook_secret = ""       # random, 32+ chars; Telegram echoes it on every webhook delivery

Setting bot_token switches on one-tap Telegram channel linking: the type card in the channel form, the link-code API, and the /hooks/telegram receiver. Empty token (the default) leaves the feature absent entirely — self-host deployments keep the bring-your-own telegram transport, which needs no operator config.

When enabled, boot validates the trio: non-empty bot_username, webhook_secret of 32+ characters, and an https:// auth.public_base_url (Telegram only delivers webhooks to public https endpoints). The app then verifies the token against the Bot API and registers the webhook on every boot; a Telegram outage logs a warning and disables the bot for that boot instead of failing the deploy.

All three values are operator secrets: env-only in production, never in a committed config file.

Provider OAuth connect (“Add to Slack” / “Add to Discord”)

[slack_oauth]
client_id = ""            # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_ID
client_secret = ""        # env UPTIMEPAGE_SLACK_OAUTH__CLIENT_SECRET

[discord_oauth]
client_id = ""            # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_ID
client_secret = ""        # env UPTIMEPAGE_DISCORD_OAUTH__CLIENT_SECRET

Credentials of operator-owned OAuth apps — Slack with the incoming-webhook scope, Discord with webhook.incoming. When a pair is set, that provider’s panel in the channel form grows a connect button (plus a QR variant): the provider’s consent screen picks the destination channel and the callback stores the returned webhook as a regular slack/discord channel — access tokens are discarded. The app’s redirect URL must be <auth.public_base_url>/auth/slack/callback (or …/auth/discord/callback). Empty credentials (the default) hide the button; manual webhook paste always works. Env-only in production, never in a committed config file.

Public status page

The [public_status] block configures the per-org public surface. It is load-bearing only when tenancy.subdomain_public_routes = true; the defaults are safe to leave untouched for self-host.

[public_status]
base_domain = ""                       # REQUIRED when subdomain_public_routes = true
cache_max_orgs = 1000                  # hot + last-good cache bound
cache_ttl_secs = 10                    # per-org rendered-page TTL
last_good_ttl_secs = 3600              # idle eviction for the stale-fallback layer
logo_dir = "/var/lib/uptimepage/logos"
max_logo_size_bytes = 1048576          # 1 MiB byte ceiling (pre-decode)
allowed_logo_mime_types = ["image/png", "image/jpeg", "image/webp"]
max_logo_dimension_px = 1200           # larger uploads are downscaled; decode
                                       # is also allocation-bounded (bomb guard)
default_brand_color = "#3b82f6"        # used when an org sets no colour
default_show_powered_by = true
public_per_ip_rate_limit_per_min = 60  # in-app limit behind the Caddy-side one
KeyPurpose
base_domainparent domain for {slug}.{base_domain}. Must be multi-label; boot fails on empty/single-label when subdomain routing is on
cache_max_orgs / cache_ttl_secsper-org page cache size and freshness window
last_good_ttl_secshow long an idle org’s last-known-good snapshot is retained before eviction
logo_dir, max_logo_size_bytes, allowed_logo_mime_types, max_logo_dimension_pxlogo upload storage and limits
default_brand_color, default_show_powered_byfallbacks when an org leaves branding unset
public_per_ip_rate_limit_per_minsecond-layer rate limit behind the reverse proxy’s

History-strip length (90 days) and the recent-incidents horizon (30 days) remain hard-coded defaults in src/public_status/aggregator.rs. What a page publishes is curated per-page — a monitor appears as a component only while it’s bound to that page, and its presentation lives on the binding:

Per-page component fieldPurpose
(binding exists)the monitor is published as a component on that page
public_namedisplay name (falls back to operator-side monitor name)
public_descriptionoptional one-liner
public_groupoptional group label; ungrouped components render last
sort_orderASC integer sort within a group

See Public status page for the operator workflow and Per-org status pages for the SaaS subdomain model.

Trace export

OpenTelemetry spans are exported over OTLP/HTTP (protobuf) when both observability.tracing_enabled and observability.grafana.enabled are true. Disabled by default and zero-cost when off.

[observability]
tracing_enabled = false                # master on/off for trace export

[observability.grafana]
enabled = false                        # second switch; both must be true
otlp_endpoint = ""                     # OTLP base, no /v1/traces suffix; e.g.
                                       # https://otlp-gateway-<zone>.grafana.net/otlp
instance_id = ""                       # Grafana Cloud numeric instance / stack id
trace_sample_ratio = 0.05              # parent-based head sampling, [0.0, 1.0]
# api_key                              # NEVER in TOML — env var only (below)
KeyPurpose
tracing_enabledmaster switch; with grafana.enabled gates all export
grafana.enabledsecond switch (kept separate so the block is inert until explicitly turned on)
grafana.otlp_endpointOTLP/HTTP base URL; the service appends /v1/traces (a value already ending in it is left as-is). Empty fails boot when export is on
grafana.instance_idbasic-auth username (Grafana Cloud instance id). Empty fails boot when export is on
grafana.api_keybasic-auth password. Env-only: UPTIMEPAGE_OBSERVABILITY__GRAFANA__API_KEY. Never read from a config file; redacted in any serialised config
grafana.trace_sample_ratiohead sampling ratio under a parent-based sampler. Must be in [0.0, 1.0] or boot fails

Auth is Authorization: Basic base64(instance_id:api_key). Resource attributes service.name = uptimepage and service.version are attached. The batch exporter is flushed and stopped on graceful shutdown. A transport build failure logs a warning and the service continues without traces — telemetry never takes down monitoring. Inconsistent settings (export on with a missing endpoint / instance / key, or an out-of-range ratio) are a clean startup config error.

Tuning notes

  • max_concurrent_checks caps simultaneous in-flight checks. Per-check memory is small (a tokio task plus an in-flight hyper request), so the practical ceiling is set by file descriptors and ephemeral ports rather than RAM.
  • per_host_max_inflight (default 2) is the per-tenant per-(host, port) in-flight cap. One tenant fanning a burst of checks at the same upstream looks like a probe; this cap keeps that fingerprint flat. Tenant-scoped — one customer’s burst never starves another customer’s monitor of the same host. Fail-fast: a check that would exceed the cap is recorded as degraded with error="throttled: host concurrency cap" and skipped (no alert fired — the upstream is fine, the back-pressure is operator-side). Counters: uptimepage_host_throttle_waits_total{kind="host"} (attempts) and uptimepage_host_throttle_drops_total (rejections).
  • rdap_max_inflight (default 1) is the process-wide per-TLD RDAP concurrency cap (across all tenants). Daily check cadence + per-TLD slot means deep queues drain quickly without bursting any registry. Same fail-fast behavior + counters as the per-host cap.
  • storage.clickhouse.buffer_size is the mpsc capacity between worker pool and batcher. Sized for ~1 s of bursts at peak RPS. Drops increment storage_dropped_total{reason="queue_full"} — that metric is your back-pressure signal.
  • storage.clickhouse.batch_size vs batch_timeout_ms trade tail latency for throughput. 1000 / 500ms is a good starting point at ~20k rps.
  • scheduler.jitter_pct prevents synchronized fleet-wide ticks. Default 10% is enough to spread N targets across an interval without making individual schedules unpredictable.
  • dns.servers accepts either bare IPs ("1.1.1.1") or ip:port form. Used as is — no system resolver fallback.
  • security.allow_private_targets is the SSRF guard. Default false blocks:
    • Loopback (127.0.0.0/8, ::1)
    • RFC1918 private (10/8, 172.16/12, 192.168/16)
    • Link-local (169.254/16, fe80::/10) — covers AWS/GCP metadata 169.254.169.254
    • Carrier-grade NAT (100.64/10)
    • IPv6 ULA (fc00::/7), discard, IPv4-mapped private, documentation ranges
    • Multicast, broadcast, unspecified, reserved-for-future-use
    • IPv6 transition mechanisms: 2002::/16 (6to4) and 64:ff9b::/96 (NAT64) are decoded to their embedded IPv4 and rejected when the inner IPv4 falls in any blocked range The guard runs both at API submission (rejects IP-literal URLs synchronously) and after DNS resolution at connect time (catches DNS rebinding). Flip to true for internal monitoring where private targets are the goal — operators are then responsible for network segmentation.
  • security.credentials_kek_base64 enables AES-256-GCM encryption of HTTP basic_auth and bearer_token values inside the targets.check_spec JSONB column. Generate with openssl rand -base64 32. Each write produces a fresh 12-byte random nonce; the on-disk shape is {"$enc":"v1:<nonce>:<ciphertext>"}. When the key is unset the service logs a startup warning and stores credentials plaintext (dev-friendly upgrade path — existing plaintext rows continue to read after a key is provisioned). Rotation and KMS integration are out of scope for the current version; treat the KEK as long-lived and protect it via your secret-management of choice (env file with restricted mode, container secret, etc.). A malformed KEK fails the process at startup.
  • api.rate_limit applies a per-peer-IP token bucket only to /api/v1/* routes (/healthz and /readyz are excluded so liveness probes never see 429). per_second is the refill rate; burst is the bucket capacity. Excess requests get 429 Too Many Requests with a Retry-After header. The bucket key is the TCP peer IP — when the service sits behind a reverse proxy, every client appears as the proxy IP, so prefer doing rate limiting at the proxy in that topology. Disabled by default; leave it off and let your reverse proxy enforce limits unless you bind the API directly to the internet.
  • TLS cert checks (type = "tls_cert") open a dedicated TCP+TLS handshake per probe — separate from the HTTP check path. Recommended interval >= 3600 so probe traffic stays light. The check accepts any cert chain (the goal is to report expiry status, not enforce trust), so an expired or self-signed cert still produces a structured result rather than a generic handshake error.
  • Domain expiry checks (type = "domain_expiry") query RDAP via a process-shared outbound HTTPS client. The IANA bootstrap registry (https://data.iana.org/rdap/dns.json) is fetched lazily on first use and cached for process lifetime — a registry update or a transient bootstrap failure persists until restart. RDAP servers rate-limit clients, so interval >= 3600 is enforced server-side and daily is typical. SSRF guard does not gate these requests because the destination is an IANA-published endpoint, not the user-supplied domain.
    • Sticky last-good. Each successful probe persists (expiry_at, registrar, last_success_at) to the domain_expiry_state table (PK target_id, denormalised org_id; every query filters on both). On a transient probe failure — throttle, timeout, registry 5xx, RDAP 404, network blip — the executor returns the cached verdict instead of flipping the monitor to Degraded/Down. For Up the customer-facing error field stays empty; Degraded/Down carry a served_stale: … annotation so operators can distinguish a stale serve from a fresh probe. Operators also see the staleness via the uptimepage_domain_expiry_stale_served_total counter.
    • Staleness ceiling: 7 days. A cached row older than 7d is treated as “registry unreachable for too long” and surfaces as a real Error, which is alert-eligible.
    • Cross-tenant singleflight. Concurrent probes for the same domain coalesce to one outbound RDAP request. Cache TTL on the singleflight slot is 60s — short enough that each scheduled cycle still fetches fresh, long enough to absorb scheduler-jitter waves at scale. Counter: uptimepage_rdap_singleflight_total{outcome="hit"|"miss"}.
  • Notification channels are no longer global config. They are per-org runtime resources (Slack / Discord / Teams / Google Chat webhooks, generic HTTP webhook, Telegram bot, WhatsApp Cloud API) created via the /api/v1/notification-channels API; a target binds them by id in its alerts array. Transport secrets are sealed at rest with the credentials KEK and never echoed back. Slack POSTs { "text": "..." }; the generic webhook POSTs the incident-notice JSON (plus any configured custom headers, optionally HMAC-signed — see docs/api.md). Notifications are driven by the incident engine and persisted per attempt, so delivery state survives a restart. The binding syntax and the monitor-level firing policy (confirmations, recovery, reminders, region quorum) are documented in docs/api.md.
  • api.cors opens /api/v1/* to browser-origin access. Each entry in allowed_origins must be a full origin (https://app.example.com) — wildcards are not parsed; set allow_any_origin = true to send Access-Control-Allow-Origin: * explicitly. The two are mutually exclusive — combining them or enabling CORS with an empty list aborts startup. allowed_methods is echoed in the preflight response (Access-Control-Allow-Methods); Access-Control-Allow-Headers is fixed to content-type, which is what the JSON API needs. /healthz and /readyz are not wrapped, so liveness probes are unaffected.