Quotas & rate limits
Every organization is bound to a plan. The plan is the single source of
truth for resource quotas and per-minute rate budgets — the number a request
is enforced at is the same number the API reports back. Adding a paid tier
later is one row in the plans table plus a UI page; nothing in the
enforcement path changes.
The free plan
Shipped and seeded on first migration. Generous for a small team, bounded enough to keep abuse on a small VM cheap.
| Quota | Free | Meaning |
|---|---|---|
max_targets | 10 | Monitored targets in the org |
min_check_interval_secs | 60 | Plan-side floor on a target’s check interval. The effective floor is max(this, kind_min) — kind_min is 3600 for tls_cert / domain_expiry and 10 for http / tcp / dns. |
retention_days | 90 | Informational — actual check-result retention is the flat ClickHouse table TTL (90d for every org), not this column |
max_members | 5 | Active members in the org |
max_pending_invitations | 10 | Outstanding (unaccepted) invitations |
max_api_tokens_per_user | 5 | API tokens a single user may hold |
max_status_pages | 1 | Public status pages the org can run |
max_public_components | 10 | Distinct monitors published across all of the org’s pages (a monitor on several pages counts once) |
max_maintenance_windows | 20 | Scheduled maintenance windows |
max_notification_channels | 20 | Notification channels (Slack/webhook/Telegram/WhatsApp/SMS/…) in the org |
max_logo_size_bytes | 1048576 | Status-page logo upload ceiling (1 MiB) |
| Rate budget (per minute) | Free | Category |
|---|---|---|
api_writes_per_minute | 600 | POST/PATCH/DELETE on /api/v1/* |
api_reads_per_minute | 6000 | GET/HEAD/OPTIONS on /api/v1/* |
bulk_ops_per_minute | 30 | /api/v1/targets/bulk* |
test_now_per_minute | 60 | POST /api/v1/targets/test + the notification-channel test endpoints |
check_now_per_minute | 60 | POST /api/v1/targets/{id}/check-now |
How quotas are enforced
A resource quota is checked atomically at the write, not by a check-then-act in the handler. The friendly handler-side pre-check exists only to produce a clean error on the common, uncontended path; the race-safe guarantee is in the store:
- Targets — the count bound is inside the
INSERT(single and bulk), handed the samemax_targets. Concurrent creates atlimit - 1settle at exactlylimit, never more. - Members — the membership insert runs under a per-org advisory lock,
counts, and rolls itself back if it crossed
max_members. Re-adding an existing member stays a no-op. - Pending invitations — dedupe and the pending cap are enforced in one transaction under the same per-org lock; parallel duplicate-email invites yield exactly one row.
- Public components — flipping a target public is gated on
create,bulk, andPATCH(so “create private, then edit public” cannot bypass the cap). - API tokens — count-in-
INSERT, scoped per user, handedmax_api_tokens_per_user.
Exceeding a resource quota returns 422:
{
"error": {
"code": "QUOTA_EXCEEDED",
"message": "max_targets limit reached: 10 of 10 used on the free plan.",
"field": null,
"details": { "quota": "max_targets", "current": 10, "limit": 10, "plan": "free" },
"trace_id": null
}
}
The pending-invitation cap is the one exception to the code: it predates the
unified envelope and returns 409 INVITATIONS_LIMIT. The cap itself is
enforced identically (atomic, never overshoot).
A sub-minimum check interval is its own 422, MIN_CHECK_INTERVAL, enforced
on create and PATCH, single and bulk — a target created at the floor cannot
be edited below it. The floor is max(plan.min_check_interval_secs, kind_min):
the per-kind value (3600 for tls_cert / domain_expiry, 10 for the rest)
applies regardless of plan tier — polling an expiry probe faster than once an
hour yields no signal.
Rate limiting
Two app-side tiers, both keyed on the authenticated subject (never the
TCP peer): (org, category) and (user, category). Both are checked; the
org tier fires first because it protects shared resources. The per-minute
budget comes from the org’s plan. The request category is derived from the
path and method:
- path contains
/bulk→bulk_ops - path ends
/test→test_now - path ends
/check-now→check_now - otherwise
GET/HEAD/OPTIONS→api_reads, else →api_writes
Exceeding a budget returns 429 with a Retry-After header:
{
"error": {
"code": "RATE_LIMITED",
"message": "Too many requests.",
"field": null,
"details": { "scope": "per_org_api_writes", "retry_after_secs": 30 },
"trace_id": null
}
}
The limiter is a governor cell per (scope, category) key in a DashMap.
A janitor evicts entries idle past the threshold so the map stays bounded by
the number of active tenants, not by request volume; its lifetime is bound
to the limiter so a refactor cannot silently drop the sweep and leak the
map. Unauthenticated requests fall through untouched — per-IP limiting for
those (auth endpoints, org creation, the public status surface) is the
reverse proxy’s job; see Deployment.
Checks themselves are not rate-limited — the scheduler path never enters this middleware, so monitoring throughput is unaffected.
Every quota / rate-limit / abuse rejection is recorded to the append-only
quota_events table (event, quota_name, details, hashed IP) as
fire-and-forget — it never blocks the response. It is the data source for
abuse review.
Usage transparency
| Endpoint | Returns |
|---|---|
GET /api/v1/orgs/{id}/usage | Plan + current vs limit for every org-scoped quota, policy values, rate budgets, feature flags. Member-gated (a non-member gets the same 404 as GET /orgs/{id}). |
GET /api/v1/me/usage | The caller’s api_tokens and owned_orgs current/limit. |
The operator UI surfaces the same numbers at /settings/usage as progress
bars (an unlimited self-host limit renders as ∞). Reported limit == enforced
limit by construction: both read the same plan and the same count query.
Anti-abuse
Two deny-lists, applied when a target is created, bulk-created, updated, or
test-run. A block is a 400, audited to quota_events with
event = abuse_blocked.
- URL patterns — a case-insensitive regex set of attack-recon paths
(exposed VCS dirs,
.env, credential paths, admin panels, WordPressxmlrpcpingback, Spring actuator, backup/dump extensions, …). A match is400 URL_PATTERN_BLOCKED/ABUSE_BLOCKED. The shipped patterns and the compiled fallback are kept byte-identical by a drift guard. - Domains — a YAML deny-list (
config/abuse_denylist.yaml) matched hierarchically: listingexample.comalso blockseu.status.example.com. It carries the operator’s own domain (don’t monitor yourself) and competing uptime/status providers (monitoring another monitor forms a load-amplification chain). A match is400 DOMAIN_DENYLISTED. Dedicated monitoring SaaS are listed at the apex; multi-tenant status-page hosts are listed narrowly so legitimate vendor-status checks are not over-blocked.
The list loads once at startup; changes need a restart in this release. A bad regex or malformed YAML is a clean startup config error, never a crash loop.
Configuration
[quotas]
plan_cache_ttl_secs = 300 # org→plan cache; a plans-table edit takes
usage_cache_ttl_secs = 10 # effect within this window
A plans-table change is invisible until the plan cache’s TTL elapses (a cache hit is zero DB round-trips on the hot path), then the next lookup refetches.
Single-tenant deploys raise limits the same way SaaS does: edit (or
INSERT) the plans row the org is assigned to, or attach a
plan_overrides row with the cap fields you want to raise. There is no
config-side override knob — every quota lives in Postgres so the
audit-trail covers both modes.
Every numeric quota / rate / interval is validated at config load —
< 1 is rejected with the offending field named, never a panic in
router or limiter construction.
The reverse-proxy per-IP tiers (auth endpoints, org creation, public surface) are documented in Deployment.