Skip to content

feat(otel): ax otel — OTLP receiver coverage + freshness view#609

Merged
Necmttn merged 3 commits into
mainfrom
feat/608-feat-otel-ax-otel-otlp-coverage
Jun 25, 2026
Merged

feat(otel): ax otel — OTLP receiver coverage + freshness view#609
Necmttn merged 3 commits into
mainfrom
feat/608-feat-otel-ax-otel-otlp-coverage

Conversation

@Necmttn

@Necmttn Necmttn commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Closes #608.

Why

Comparing ax to latitude-llm's OTLP story surfaced a gap: ax's OTLP receiver was write-only. Telemetry lands in otel_metric_point / otel_log_event / otel_span and only enriches existing insights via telemetry_of — there was no CLI, no MCP, no way to answer "is telemetry even flowing, and is it being correlated to my sessions?"

(latitude scrapes full prompt/response/tool I/O from OTLP request bodies → cloud. ax deliberately keeps OTLP content-stripped — that content is already in turn.text / tool_call.input_json|output_json from transcript parsing, so capturing bodies would only duplicate + re-leak it. No change there.)

What

ax otel [--days=N] [--json]:

  • Signal health — per (harness, signal) all-time row count + freshness → verdict (✓ flowing <6h / ⚠ stale <48h / ✗ cold / · none).
  • Correlation coverage — share of windowed sessions carrying a telemetry_of edge. A live 0% loudly flags telemetry arriving but the correlation pass drawing no edges.
  • Cost cross-check — OTLP claude_code.cost.usage vs transcript cost over the window (per-event log token sums NOT surfaced — they double-count).

Live output (this machine — receiver currently stale, correlation broken)

signal                        rows    last
✗ claude/metric             16,700      2d
✗ codex/log              1,546,147      2d
  ✓ flowing (<6h)   ⚠ stale (<48h)   ✗ cold   · none

correlation: 0/1,501 sessions linked (0%)  [14d]
  ⚠ telemetry is arriving but 0% is correlated to sessions - the telemetry_of pass is drawing no edges (check session.id matching)

cost [14d]: otlp $2542.99 (claude cost metric)   ·   transcript $20674.40 (all sources)

The view immediately earns its keep — telemetry_of has 0 edges across 5,355 sessions. Filed as follow-up.

ship-checklist

  • B on-demand: ax otel CLI + --json. ✓
  • C proactive: MCP tool otel (3 roster pins updated). ✓ · ax improve generator / dojo item — not wired (health is a check-on-demand, not a recurring proposal). Skipped intentionally.
  • D docs: CLAUDE.md, llms.txt, cli-reference, VISIBLE_COMMANDS. ✓
  • F verify: 21 unit assertions (pure health/coverage helpers + render warn paths) + MCP roster tests; live-verified above; tsc 0 errors.
  • Deferred: ax doctor OTLP-freshness line (doctor is runtime:"none", no DB — health lives in the command); studio trace waterfall.

🤖 Generated with Claude Code

The OTLP receiver was write-only: harness telemetry landed in
`otel_metric_point` / `otel_log_event` / `otel_span` and only enriched
existing insights via `telemetry_of`, with no surface to inspect whether
telemetry was even flowing or being correlated.

`ax otel [--days=N] [--json]` adds the read path:
- per (harness, signal) all-time row count + freshness → health verdict
  (✓ flowing <6h / ⚠ stale <48h / ✗ cold / · none);
- session correlation coverage (share of windowed sessions carrying a
  telemetry_of edge) — a live 0% loudly flags telemetry arriving but the
  correlation pass drawing no edges;
- OTLP claude cost metric vs transcript cost as an independent cross-check
  (per-event log token sums intentionally not surfaced — they double-count).

Read-only `db` query (deref-free; signals all-time, coverage+cost windowed),
MCP tool `otel`, and the 4 doc gates (CLAUDE.md, llms.txt, cli-reference,
VISIBLE_COMMANDS). OTLP stays content-stripped on purpose: the prompts/tool I/O
another tool would scrape from request bodies are already in turn/tool_call
from transcript parsing.

Deferred: `ax doctor` OTLP nudge (doctor is runtime:"none", no DB) and a studio
trace waterfall.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 25, 2026

Copy link
Copy Markdown

Deploying ax with  Cloudflare Pages  Cloudflare Pages

Latest commit: b45db53
Status: ✅  Deploy successful!
Preview URL: https://1af94773.ax-62d.pages.dev
Branch Preview URL: https://feat-608-feat-otel-ax-otel-o.ax-62d.pages.dev

View logs

… edge

The first `ax otel` cut counted `telemetry_of` edges for coverage, which read a
hard 0% - the edge is empty (separate bug, #610). But the edge is also the wrong
thing to measure: no enrichment query reads it (telemetry-rollup.ts joins
`session_id` directly), so coverage should too.

Coverage is now: share of windowed TOP-LEVEL sessions whose uuid matches an otel
`session_id`. otel stores a bare uuid; `session.id` is `session:⟨uuid⟩`, so a
`bareUuid` helper compares uuids in JS. Subagents (`*-subagent` ids, no uuid)
are excluded - OTLP is emitted at the top-level session, never per-subagent.

Live: 153/279 top-level sessions (54.8%), a true number, replacing the
edge-dependent 0%. Docs (CLAUDE.md / llms.txt / cli-reference) updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Necmttn added a commit that referenced this pull request Jun 25, 2026
…emental (#610)

The pass drew ZERO edges: `type::record("session:" + session_id)` evaluated the
concat as arithmetic for hyphenated uuids -> `session:019fbf3f` (dropped
everything after the first hyphen), so the IN-check never matched. otel
`session_id` is a bare uuid while `session.id` is the escaped `session:⟨uuid⟩`
record, so we now match uuid-to-uuid in JS instead of trusting type::record.

Also reshaped to be cheap on the ingest hot path (this runs after EVERY ingest,
including the watcher's `--since=1`):
- SESSION-GRAIN: one edge per top-level session that has telemetry (the edge
  means "session has telemetry"; no data query reads it - enrichment joins
  session_id directly), not one per otel row. Codex emits ~1.5M log rows for a
  few hundred sessions; row-grain would write millions of edges nobody consumes.
- INCREMENTAL: candidates = existing-but-unlinked sessions, probed with
  `session_id IN [...]` over the schema's session_id index, chunked at 500 (the
  telemetry-rollup.ts pattern). The old full `GROUP BY session_id` enumerated all
  1.5M log rows (~8s) on every ingest; this scales with new sessions only.
- idempotency is the in-memory `linked` set (drives candidates), replacing the
  per-row `count(<-telemetry_of)=0` graph traversal.

Live (session-grain): 155 edges, idempotent re-run adds 0. Coverage itself reads
session_id directly (PR #609) and never depended on this edge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The verify gate requires every visible subcommand to appear in README.md or
docs/cli.md. Add the OTLP receiver health section + the `otel` MCP tool entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Necmttn Necmttn merged commit dc4267b into main Jun 25, 2026
3 checks passed
Necmttn added a commit that referenced this pull request Jun 25, 2026
…emental (#611)

* fix(otel): correlate telemetry_of by uuid match, session-grain + incremental (#610)

The pass drew ZERO edges: `type::record("session:" + session_id)` evaluated the
concat as arithmetic for hyphenated uuids -> `session:019fbf3f` (dropped
everything after the first hyphen), so the IN-check never matched. otel
`session_id` is a bare uuid while `session.id` is the escaped `session:⟨uuid⟩`
record, so we now match uuid-to-uuid in JS instead of trusting type::record.

Also reshaped to be cheap on the ingest hot path (this runs after EVERY ingest,
including the watcher's `--since=1`):
- SESSION-GRAIN: one edge per top-level session that has telemetry (the edge
  means "session has telemetry"; no data query reads it - enrichment joins
  session_id directly), not one per otel row. Codex emits ~1.5M log rows for a
  few hundred sessions; row-grain would write millions of edges nobody consumes.
- INCREMENTAL: candidates = existing-but-unlinked sessions, probed with
  `session_id IN [...]` over the schema's session_id index, chunked at 500 (the
  telemetry-rollup.ts pattern). The old full `GROUP BY session_id` enumerated all
  1.5M log rows (~8s) on every ingest; this scales with new sessions only.
- idempotency is the in-memory `linked` set (drives candidates), replacing the
  per-row `count(<-telemetry_of)=0` graph traversal.

Live (session-grain): 155 edges, idempotent re-run adds 0. Coverage itself reads
session_id directly (PR #609) and never depended on this edge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(otel): window correlation by observed_at + CONCURRENTLY index builds

Two follow-ups after live-verifying the first cut of this PR:

1. The "candidates = existing - linked" scan was actually slower (32s cold / 3.4s
   warm): transcript-only sessions never get telemetry, so they stay candidates
   and get re-probed every ingest forever - O(all sessions), not O(new).
   Replaced with a windowed scan: only telemetry observed in the last 2 days,
   over the `observed_at` index (range scan, ~30ms), then filtered to existing +
   unlinked sessions in JS. OTLP arrives with its transcript, so recent telemetry
   is exactly what a fresh ingest needs to link. (Keep the WHERE observed_at-only
   - a leading `session_id != NONE` defeated the index and full-scanned, 8s.)

2. Root-caused every DB wedge in this work: a plain `DEFINE INDEX` takes a TABLE
   LOCK while building, so re-applying the schema (`ax install`) onto an already-
   large otel_log_event (~1.5M codex log rows) wedges the daemon. All otel index
   builds are now `CONCURRENTLY` (background build, no lock). Added matching
   `observed_at` indexes for the windowed scan above.

Live: 2d window = ~30ms (index used); a 5d replay over real data created 33
correct edges (proper escaping), confirming the relate path. 7 unit tests + 83
otel/schema tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(otel): ax otel — OTLP coverage + freshness view

1 participant