Telemetry Architecture
Summary
This page documents the internal architecture of AtlasBurn's telemetry pipeline — from SDK-level interception to the Forensic Ledger. Understanding this pipeline helps debug missing data, explain latency characteristics, and reason about reliability guarantees.
System maturity: Stable
Pipeline Overview
Telemetry reaches the ledger via one of two paths:
Path A — SDK Integration (client-side batching)
- SDK Interception —
globalThis.fetchis monkey-patched in your runtime to capture AI API calls. - Background Batch Flush — events are queued and posted asynchronously to
/api/ingest. - Normalization & Ledger — fields are normalized and written to the Forensic Ledger.
Path B — Edge Proxy Integration (server-side interception)
- Edge Interception — the Cloudflare Worker sits between your app and the provider, capturing requests and responses inline.
- Streaming Support — SSE streams pass through unmodified; usage metadata is extracted from the final frame.
- Synchronous Enforcement — guardrail kill flags are read from KV on the same request before forwarding upstream.
- Normalization & Ledger — same downstream pipeline as Path A.
Event Lifecycle
| Stage | What Happens | Failure Behavior |
|---|---|---|
| Ephemeral ID Generation | Unique event ID created | Fail-silent, event dropped |
| Host Execution | Original LLM call proceeds normally | N/A — not affected by SDK |
| Post-Execution Extraction | Usage metadata pulled from response | Fail-silent, event dropped |
| Background Batch Flush | Events queued and sent asynchronously | 3 retries, then dropped |
Normalization Engine
Providers use different field names for the same data. The normalization engine resolves these into a single deterministic schema:
| Provider | Raw Field | Normalized To |
|---|---|---|
| OpenAI | prompt_tokens | inputTokens |
| Anthropic | input_token_count | inputTokens |
| Google Gemini | usageMetadata | inputTokens / outputTokens |
Streaming Usage
Streamed responses historically did not include token counts. As of SDK v1.6.1 and the current Edge Proxy build, AtlasBurn injects stream_options.include_usage: true on OpenAI streaming chat-completion requests. The provider then emits a final SSE frame containing the real usage object, which is parsed into one canonical event.
When no usage frame is emitted — legacy streams, malformed responses, or providers that do not support the option — token counts are estimated from request size and the ledger entry is flagged est or ~est. Estimated events are real and useful for guardrails, but should be excluded from precise invoice reconciliation.
Storage Model
- API keys — never stored in plaintext; HMAC-SHA-256 hashed (O(1) lookup)
- Usage records — attributed to
/organizations/{orgId}/usageRecords/ - Isolation — strict owner-based multi-tenancy enforced at the Firestore Security Rule level
- Ledger — append-only, server-authored. Clients write only via the ingest API; no entry is ever updated client-side.
- What we store — model, provider, input/output token counts, cost, timestamp,
featureId. Never prompts, completions, end-user PII, or raw provider keys.
Reliability Guarantees
- Telemetry is best-effort — the SDK prioritizes host application stability over telemetry completeness
- Background queue is capped at 200 events to prevent memory leakage
- Retry budget: 3 attempts with exponential backoff
- Default batch flush interval: ~5 seconds
- If the queue overflows or retries are exhausted, events are dropped silently
What This Means in Practice
The telemetry pipeline is designed for zero-impact integration. Your LLM calls always complete normally, even if the telemetry backend is entirely unreachable. The trade-off is best-effort delivery — events surface in the Forensic Ledger within roughly one batch interval (~5–10s) under normal network conditions.
Next Steps
- Cost Engine — how observed spend is calculated
- Security — key handling and isolation model
- Troubleshooting — diagnosing missing telemetry