Telemetry Architecture

Summary

This page documents the internal architecture of AtlasBurn's telemetry pipeline — from SDK-level interception to the Forensic Ledger. Understanding this pipeline helps debug missing data, explain latency characteristics, and reason about reliability guarantees.

System maturity: Stable

The telemetry pipeline is production-ready and has been audited against provider invoices.

Pipeline Overview

Telemetry reaches the ledger via one of two paths:

Path A — SDK Integration (client-side batching)

  1. SDK InterceptionglobalThis.fetch is monkey-patched in your runtime to capture AI API calls.
  2. Background Batch Flush — events are queued and posted asynchronously to /api/ingest.
  3. Normalization & Ledger — fields are normalized and written to the Forensic Ledger.

Path B — Edge Proxy Integration (server-side interception)

  1. Edge Interception — the Cloudflare Worker sits between your app and the provider, capturing requests and responses inline.
  2. Streaming Support — SSE streams pass through unmodified; usage metadata is extracted from the final frame.
  3. Synchronous Enforcement — guardrail kill flags are read from KV on the same request before forwarding upstream.
  4. Normalization & Ledger — same downstream pipeline as Path A.

Event Lifecycle

StageWhat HappensFailure Behavior
Ephemeral ID GenerationUnique event ID createdFail-silent, event dropped
Host ExecutionOriginal LLM call proceeds normallyN/A — not affected by SDK
Post-Execution ExtractionUsage metadata pulled from responseFail-silent, event dropped
Background Batch FlushEvents queued and sent asynchronously3 retries, then dropped

Normalization Engine

Providers use different field names for the same data. The normalization engine resolves these into a single deterministic schema:

ProviderRaw FieldNormalized To
OpenAIprompt_tokensinputTokens
Anthropicinput_token_countinputTokens
Google GeminiusageMetadatainputTokens / outputTokens

Streaming Usage

Streamed responses historically did not include token counts. As of SDK v1.6.1 and the current Edge Proxy build, AtlasBurn injects stream_options.include_usage: true on OpenAI streaming chat-completion requests. The provider then emits a final SSE frame containing the real usage object, which is parsed into one canonical event.

When no usage frame is emitted — legacy streams, malformed responses, or providers that do not support the option — token counts are estimated from request size and the ledger entry is flagged est or ~est. Estimated events are real and useful for guardrails, but should be excluded from precise invoice reconciliation.

Storage Model

  • API keys — never stored in plaintext; HMAC-SHA-256 hashed (O(1) lookup)
  • Usage records — attributed to /organizations/{orgId}/usageRecords/
  • Isolation — strict owner-based multi-tenancy enforced at the Firestore Security Rule level
  • Ledger — append-only, server-authored. Clients write only via the ingest API; no entry is ever updated client-side.
  • What we store — model, provider, input/output token counts, cost, timestamp, featureId. Never prompts, completions, end-user PII, or raw provider keys.

Reliability Guarantees

  • Telemetry is best-effort — the SDK prioritizes host application stability over telemetry completeness
  • Background queue is capped at 200 events to prevent memory leakage
  • Retry budget: 3 attempts with exponential backoff
  • Default batch flush interval: ~5 seconds
  • If the queue overflows or retries are exhausted, events are dropped silently

What This Means in Practice

The telemetry pipeline is designed for zero-impact integration. Your LLM calls always complete normally, even if the telemetry backend is entirely unreachable. The trade-off is best-effort delivery — events surface in the Forensic Ledger within roughly one batch interval (~5–10s) under normal network conditions.

Next Steps