Autonomous Runtime Guardrails
Summary
AtlasBurn's guardrail engine is a server-side, multi-layer defense system that detects and kills runaway agents autonomously — without manual review. It replaces the legacy SDK-level Auto Kill, which only enforced simple financial caps.
Detection runs in the ingestion server. Enforcement runs at the edge via Cloudflare KV with <5ms latency impact.
System maturity: Stable
Cloudflare KV Architecture
Detection runs server-side in the ingestion pipeline. When any layer trips, the engine writes a global SUSPENDED flag for the organization to the GUARDRAIL_FLAGS Cloudflare KV namespace. Every edge worker reads the flag on every request, so subsequent calls are blocked at the edge in under 5ms — before the upstream provider is ever contacted.
Law 4 — Never crash the proxy (fail open)
The 5-Layer Defense Engine
Layer 1 — Financial Limits
Hard and soft caps on daily and monthly spend per organization. Soft caps emit a warning to the dashboard and notification channels; hard caps write a SUSPENDED flag to KV and reject subsequent calls at the edge with 429 Budget Exceeded.
Layer 2 — Dumb Loop Detector (Identical Token Fingerprinting)
Computes a deterministic fingerprint of the request body (model + messages + tools) for every call. If 4 consecutive requests share an identical prompt-token fingerprint, the agent is killed. This catches the classic "agent stuck in a while-true loop sending the exact same context" failure mode.
OpenRouter caveat
Layer 3 — Smart Loop Detector (Context Inflation)
Tracks chat-history token counts per agent session. If counts grow monotonically and linearly — the classic signature of an agent re-appending a repeating JSON-parse error to its context window before each retry — the engine triggers the kill switch before the context window (and the bill) explodes.
Layer 4 — RPS Burst Protection (Frequency Limits)
Per-key request rate limiter. If any API key fires more than 10 requests per minute sustained, the kill flag is set. The window is sliding-minute-based, not bucketed, which eliminates the concurrent database race conditions that plague fixed-window limiters. This is the ultimate safety net when fingerprinting can't help — for example, with OpenRouter where load-balancing defeats exact-match detection.
Layer 5 — Premium Model Concentration
Detects sudden spikes where >90% of spend concentrates on the most expensive models (e.g. Claude 3.5 Opus, GPT-5). This pattern is highly indicative of prompt injection, hijacked routing, or a misconfigured model selector falling back to the most expensive option.
Cloudflare KV Kill Switch Architecture
- Detection — the ingestion server (
/api/ingest) processes every telemetry event in real time. When a layer triggers, it calls/api/guardrails/enforce. - Flag Push — the enforcement endpoint writes a
SUSPENDEDflag for the organization to the globalGUARDRAIL_FLAGSKV namespace. - Edge Enforcement — every proxy worker reads the flag from KV on every request. Because KV is replicated at the edge, the latency impact is <5ms.
- Auto-Recovery (TTL) — flags carry a TTL (default 2 hours). After expiry the system heals itself and the org resumes automatically.
- Manual Override — users can clear the flag instantly from the AtlasBurn dashboard.
Provider Coverage
| Provider | Layer 2 (Exact Match) | Layer 4 (RPS Burst) |
|---|---|---|
| OpenAI | Reliable | Reliable |
| Anthropic | Reliable | Reliable |
| OpenRouter | Best-effort (tokenizer drift) | Primary safety net |
Ingestion API Endpoints
POST /api/ingest— telemetry collection (SDK and Edge Proxy both post here)POST /api/guardrails/enforce— internal webhook that writes kill flags to Cloudflare KV
Next Steps
- AtlasBurn Edge Proxy — the enforcement layer
- Advanced Concepts — Monte Carlo forecasting and retry cascades