Telemetry Architecture

Summary

This page documents the internal architecture of AtlasBurn's telemetry pipeline — from SDK-level interception to the Forensic Ledger. Understanding this pipeline helps debug missing data, explain latency characteristics, and reason about reliability guarantees.

System maturity: Stable

The telemetry pipeline is production-ready and has been audited against provider invoices.

Pipeline Overview

Telemetry reaches the ledger via one of two paths:

Path A — SDK Integration (client-side batching)

SDK Interception — globalThis.fetch is monkey-patched in your runtime to capture AI API calls.
Background Batch Flush — events are queued and posted asynchronously to /api/ingest.
Normalization & Ledger — fields are normalized and written to the Forensic Ledger.

Path B — Edge Proxy Integration (server-side interception)

Edge Interception — the Cloudflare Worker sits between your app and the provider, capturing requests and responses inline.
Streaming Support — SSE streams pass through unmodified; usage metadata is extracted from the final frame.
Synchronous Enforcement — guardrail kill flags are read from KV on the same request before forwarding upstream.
Normalization & Ledger — same downstream pipeline as Path A.

Event Lifecycle

Stage	What Happens	Failure Behavior
Ephemeral ID Generation	Unique event ID created	Fail-silent, event dropped
Host Execution	Original LLM call proceeds normally	N/A — not affected by SDK
Post-Execution Extraction	Usage metadata pulled from response	Fail-silent, event dropped
Background Batch Flush	Events queued and sent asynchronously	3 retries, then dropped

Normalization Engine

Providers use different field names for the same data. The normalization engine resolves these into a single deterministic schema:

Provider	Raw Field	Normalized To
OpenAI	prompt_tokens	inputTokens
Anthropic	input_token_count	inputTokens
Google Gemini	usageMetadata	inputTokens / outputTokens

Streaming Usage

Streamed responses historically did not include token counts. As of SDK v1.6.1 and the current Edge Proxy build, AtlasBurn injects stream_options.include_usage: true on OpenAI streaming chat-completion requests. The provider then emits a final SSE frame containing the real usage object, which is parsed into one canonical event.

When no usage frame is emitted — legacy streams, malformed responses, or providers that do not support the option — token counts are estimated from request size and the ledger entry is flagged est or ~est. Estimated events are real and useful for guardrails, but should be excluded from precise invoice reconciliation.

Storage Model

API keys — never stored in plaintext; HMAC-SHA-256 hashed (O(1) lookup)
Usage records — attributed to /organizations/{orgId}/usageRecords/
Isolation — strict owner-based multi-tenancy enforced at the Firestore Security Rule level
Ledger — append-only, server-authored. Clients write only via the ingest API; no entry is ever updated client-side.
What we store — model, provider, input/output token counts, cost, timestamp, featureId. Never prompts, completions, end-user PII, or raw provider keys.

Reliability Guarantees

Telemetry is best-effort — the SDK prioritizes host application stability over telemetry completeness
Background queue is capped at 200 events to prevent memory leakage
Retry budget: 3 attempts with exponential backoff
Default batch flush interval: ~5 seconds
If the queue overflows or retries are exhausted, events are dropped silently

What This Means in Practice

The telemetry pipeline is designed for zero-impact integration. Your LLM calls always complete normally, even if the telemetry backend is entirely unreachable. The trade-off is best-effort delivery — events surface in the Forensic Ledger within roughly one batch interval (~5–10s) under normal network conditions.

Next Steps

Cost Engine — how observed spend is calculated
Security — key handling and isolation model
Troubleshooting — diagnosing missing telemetry