feat(otel): `ax otel` — OTLP receiver coverage + freshness view by Necmttn · Pull Request #609 · Necmttn/ax

Necmttn · 2026-06-25T02:46:20Z

Closes #608.

Why

Comparing ax to latitude-llm's OTLP story surfaced a gap: ax's OTLP receiver was write-only. Telemetry lands in otel_metric_point / otel_log_event / otel_span and only enriches existing insights via telemetry_of — there was no CLI, no MCP, no way to answer "is telemetry even flowing, and is it being correlated to my sessions?"

(latitude scrapes full prompt/response/tool I/O from OTLP request bodies → cloud. ax deliberately keeps OTLP content-stripped — that content is already in turn.text / tool_call.input_json|output_json from transcript parsing, so capturing bodies would only duplicate + re-leak it. No change there.)

What

ax otel [--days=N] [--json]:

Signal health — per (harness, signal) all-time row count + freshness → verdict (✓ flowing <6h / ⚠ stale <48h / ✗ cold / · none).
Correlation coverage — share of windowed sessions carrying a telemetry_of edge. A live 0% loudly flags telemetry arriving but the correlation pass drawing no edges.
Cost cross-check — OTLP claude_code.cost.usage vs transcript cost over the window (per-event log token sums NOT surfaced — they double-count).

Live output (this machine — receiver currently stale, correlation broken)

signal                        rows    last
✗ claude/metric             16,700      2d
✗ codex/log              1,546,147      2d
  ✓ flowing (<6h)   ⚠ stale (<48h)   ✗ cold   · none

correlation: 0/1,501 sessions linked (0%)  [14d]
  ⚠ telemetry is arriving but 0% is correlated to sessions - the telemetry_of pass is drawing no edges (check session.id matching)

cost [14d]: otlp $2542.99 (claude cost metric)   ·   transcript $20674.40 (all sources)

The view immediately earns its keep — telemetry_of has 0 edges across 5,355 sessions. Filed as follow-up.

ship-checklist

B on-demand: ax otel CLI + --json. ✓
C proactive: MCP tool otel (3 roster pins updated). ✓ · ax improve generator / dojo item — not wired (health is a check-on-demand, not a recurring proposal). Skipped intentionally.
D docs: CLAUDE.md, llms.txt, cli-reference, VISIBLE_COMMANDS. ✓
F verify: 21 unit assertions (pure health/coverage helpers + render warn paths) + MCP roster tests; live-verified above; tsc 0 errors.
Deferred: ax doctor OTLP-freshness line (doctor is runtime:"none", no DB — health lives in the command); studio trace waterfall.

🤖 Generated with Claude Code

The OTLP receiver was write-only: harness telemetry landed in `otel_metric_point` / `otel_log_event` / `otel_span` and only enriched existing insights via `telemetry_of`, with no surface to inspect whether telemetry was even flowing or being correlated. `ax otel [--days=N] [--json]` adds the read path: - per (harness, signal) all-time row count + freshness → health verdict (✓ flowing <6h / ⚠ stale <48h / ✗ cold / · none); - session correlation coverage (share of windowed sessions carrying a telemetry_of edge) — a live 0% loudly flags telemetry arriving but the correlation pass drawing no edges; - OTLP claude cost metric vs transcript cost as an independent cross-check (per-event log token sums intentionally not surfaced — they double-count). Read-only `db` query (deref-free; signals all-time, coverage+cost windowed), MCP tool `otel`, and the 4 doc gates (CLAUDE.md, llms.txt, cli-reference, VISIBLE_COMMANDS). OTLP stays content-stripped on purpose: the prompts/tool I/O another tool would scrape from request bodies are already in turn/tool_call from transcript parsing. Deferred: `ax doctor` OTLP nudge (doctor is runtime:"none", no DB) and a studio trace waterfall. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-06-25T02:47:49Z

Deploying ax with Cloudflare Pages

Latest commit:	`b45db53`
Status:	✅ Deploy successful!
Preview URL:	https://1af94773.ax-62d.pages.dev
Branch Preview URL:	https://feat-608-feat-otel-ax-otel-o.ax-62d.pages.dev

View logs

… edge The first `ax otel` cut counted `telemetry_of` edges for coverage, which read a hard 0% - the edge is empty (separate bug, #610). But the edge is also the wrong thing to measure: no enrichment query reads it (telemetry-rollup.ts joins `session_id` directly), so coverage should too. Coverage is now: share of windowed TOP-LEVEL sessions whose uuid matches an otel `session_id`. otel stores a bare uuid; `session.id` is `session:⟨uuid⟩`, so a `bareUuid` helper compares uuids in JS. Subagents (`*-subagent` ids, no uuid) are excluded - OTLP is emitted at the top-level session, never per-subagent. Live: 153/279 top-level sessions (54.8%), a true number, replacing the edge-dependent 0%. Docs (CLAUDE.md / llms.txt / cli-reference) updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emental (#610) The pass drew ZERO edges: `type::record("session:" + session_id)` evaluated the concat as arithmetic for hyphenated uuids -> `session:019fbf3f` (dropped everything after the first hyphen), so the IN-check never matched. otel `session_id` is a bare uuid while `session.id` is the escaped `session:⟨uuid⟩` record, so we now match uuid-to-uuid in JS instead of trusting type::record. Also reshaped to be cheap on the ingest hot path (this runs after EVERY ingest, including the watcher's `--since=1`): - SESSION-GRAIN: one edge per top-level session that has telemetry (the edge means "session has telemetry"; no data query reads it - enrichment joins session_id directly), not one per otel row. Codex emits ~1.5M log rows for a few hundred sessions; row-grain would write millions of edges nobody consumes. - INCREMENTAL: candidates = existing-but-unlinked sessions, probed with `session_id IN [...]` over the schema's session_id index, chunked at 500 (the telemetry-rollup.ts pattern). The old full `GROUP BY session_id` enumerated all 1.5M log rows (~8s) on every ingest; this scales with new sessions only. - idempotency is the in-memory `linked` set (drives candidates), replacing the per-row `count(<-telemetry_of)=0` graph traversal. Live (session-grain): 155 edges, idempotent re-run adds 0. Coverage itself reads session_id directly (PR #609) and never depended on this edge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The verify gate requires every visible subcommand to appear in README.md or docs/cli.md. Add the OTLP receiver health section + the `otel` MCP tool entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emental (#611) * fix(otel): correlate telemetry_of by uuid match, session-grain + incremental (#610) The pass drew ZERO edges: `type::record("session:" + session_id)` evaluated the concat as arithmetic for hyphenated uuids -> `session:019fbf3f` (dropped everything after the first hyphen), so the IN-check never matched. otel `session_id` is a bare uuid while `session.id` is the escaped `session:⟨uuid⟩` record, so we now match uuid-to-uuid in JS instead of trusting type::record. Also reshaped to be cheap on the ingest hot path (this runs after EVERY ingest, including the watcher's `--since=1`): - SESSION-GRAIN: one edge per top-level session that has telemetry (the edge means "session has telemetry"; no data query reads it - enrichment joins session_id directly), not one per otel row. Codex emits ~1.5M log rows for a few hundred sessions; row-grain would write millions of edges nobody consumes. - INCREMENTAL: candidates = existing-but-unlinked sessions, probed with `session_id IN [...]` over the schema's session_id index, chunked at 500 (the telemetry-rollup.ts pattern). The old full `GROUP BY session_id` enumerated all 1.5M log rows (~8s) on every ingest; this scales with new sessions only. - idempotency is the in-memory `linked` set (drives candidates), replacing the per-row `count(<-telemetry_of)=0` graph traversal. Live (session-grain): 155 edges, idempotent re-run adds 0. Coverage itself reads session_id directly (PR #609) and never depended on this edge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(otel): window correlation by observed_at + CONCURRENTLY index builds Two follow-ups after live-verifying the first cut of this PR: 1. The "candidates = existing - linked" scan was actually slower (32s cold / 3.4s warm): transcript-only sessions never get telemetry, so they stay candidates and get re-probed every ingest forever - O(all sessions), not O(new). Replaced with a windowed scan: only telemetry observed in the last 2 days, over the `observed_at` index (range scan, ~30ms), then filtered to existing + unlinked sessions in JS. OTLP arrives with its transcript, so recent telemetry is exactly what a fresh ingest needs to link. (Keep the WHERE observed_at-only - a leading `session_id != NONE` defeated the index and full-scanned, 8s.) 2. Root-caused every DB wedge in this work: a plain `DEFINE INDEX` takes a TABLE LOCK while building, so re-applying the schema (`ax install`) onto an already- large otel_log_event (~1.5M codex log rows) wedges the daemon. All otel index builds are now `CONCURRENTLY` (background build, no lock). Added matching `observed_at` indexes for the windowed scan above. Live: 2d window = ~30ms (index used); a 5d replay over real data created 33 correct edges (proper escaping), confirming the relate path. 7 unit tests + 83 otel/schema tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Necmttn mentioned this pull request Jun 25, 2026

bug(otel): telemetry_of correlation draws 0 edges (0% coverage despite 1.5M+ OTLP rows) #610

Closed

Necmttn mentioned this pull request Jun 25, 2026

fix(otel): correlate telemetry_of by uuid match, session-grain + incremental #611

Merged

Necmttn merged commit dc4267b into main Jun 25, 2026
3 checks passed

Necmttn mentioned this pull request Jun 25, 2026

chore(main): release 0.36.0 #586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(otel): `ax otel` — OTLP receiver coverage + freshness view#609

feat(otel): `ax otel` — OTLP receiver coverage + freshness view#609
Necmttn merged 3 commits into
mainfrom
feat/608-feat-otel-ax-otel-otlp-coverage

Necmttn commented Jun 25, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Necmttn commented Jun 25, 2026

Why

What

Live output (this machine — receiver currently stale, correlation broken)

ship-checklist

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying ax with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 25, 2026 •

edited

Loading