Skip to content

telemetry: add last-operation byte gauges to exporters#1824

Open
e-eygin wants to merge 2 commits into
ai-dynamo:mainfrom
e-eygin:telemetry-last-op-byte-gauges
Open

telemetry: add last-operation byte gauges to exporters#1824
e-eygin wants to merge 2 commits into
ai-dynamo:mainfrom
e-eygin:telemetry-last-op-byte-gauges

Conversation

@e-eygin

@e-eygin e-eygin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What?

Adds two purely-additive last-operation gauges to both the Prometheus and DOCA
telemetry exporters:

  • agent_tx_bytes_last — byte size of the most recent TX request
  • agent_rx_bytes_last — byte size of the most recent RX request

These sit alongside the existing cumulative *_total byte counters
(agent_tx_bytes / agent_rx_bytes), so a counter now reports the running
total while the _last gauge reports only the latest operation.

Implementation:

  • Prometheus (prometheus_exporter.{h,cpp}): registerGauge now separates
    the lookup key (event name) from the exposed metric name, registering
    agent_tx_bytes → agent_tx_bytes_last and agent_rx_bytes → agent_rx_bytes_last.
    exportEvent() is unchanged (it already sets the gauge keyed by event name).
  • DOCA (doca_exporter.cpp): byte events now emit BOTH a cumulative counter
    and a last-operation gauge (dispatch was previously counter-xor-gauge);
    appendGaugeSample takes an explicit metric-name argument so the gauge can be
    named *_last while keyed off the byte event.
  • Tests + docs (Prometheus/DOCA READMEs, docs/telemetry.md) updated in the
    same PR.

Why?

The byte events were routed through the counter path only, so there was no way
to see the size of the latest transfer — only the lifetime total. A PromQL
idelta(*_total[...]) derivation answers a different question (bytes per scrape
interval) and is wrong when the op-rate exceeds the scrape-rate, since NIXL
batches/flushes events (~100 ms). An explicit _last gauge is the correct
primitive.

This is zero breaking changes: no new telemetry event type, no
TELEMETRY_VERSION bump, no core change. "Last" is derived exporter-side from
the per-op value NIXL already carries (event.value_), which is delta-only, so
the gauge is stateless. The existing memory gauges are intentionally left under
their current names (agent_memory_registered / agent_memory_deregistered);
the *_last rename for those is deferred to a follow-up to avoid a metric-name
break.

How?

End-to-end tested through the real scrape endpoints (not internal state):

  • Prometheus (test/gtest/telemetry_prometheus_test.cpp): injects TX
    1000, 2000, 3500 and RX 500, 1500, scrapes the live HTTP /metrics, and
    asserts agent_tx_bytes_total == 6500 / agent_tx_bytes_last == 3500 and
    agent_rx_bytes_total == 2000 / agent_rx_bytes_last == 1500.
  • DOCA (test/doca-telemetry/telemetry_doca_nixl_test.cpp): injects TX
    10, 20, 35 and RX 5, 15, flushes, scrapes the live DOCA Prometheus
    endpoint, and asserts the counters (65, 20) and _last gauges (35, 15).
    Distinct TX/RX values also guard against cross-wiring.

Local validation: 27 telemetry gtests + 4 DOCA exporter tests pass;
diff-scoped clang-format clean.

Summary by CodeRabbit

  • New Features

    • Telemetry byte metrics now expose both cumulative counters and “last operation” gauges.
    • Prometheus and DOCA exports now publish separate _last series for TX/RX byte values.
  • Bug Fixes

    • Fixed gauge reporting so byte metrics no longer overwrite or replace counter output.
    • Clarified memory gauges to reflect the most recent registration or deregistration value.
  • Documentation

    • Updated telemetry docs and metric tables to explain counter vs. gauge behavior and the new _last naming.

Expose agent_tx_bytes_last and agent_rx_bytes_last gauges from both the
Prometheus and DOCA exporters so dashboards can read the byte size of the
latest TX/RX request alongside the existing cumulative *_total counters.

"Last" is derived exporter-side from the per-operation value already carried
by AGENT_TX_BYTES / AGENT_RX_BYTES: NIXL is a delta-only producer, so the
event value is itself the last operation's value. No new event type, no
TELEMETRY_VERSION bump, and no core change are required; the gauge is
stateless.

- Prometheus: registerGauge now separates the lookup key (event name) from
  the exposed metric name, registering agent_tx_bytes -> agent_tx_bytes_last
  and agent_rx_bytes -> agent_rx_bytes_last. Memory gauge names are left
  unchanged (the *_last rename is deferred to keep this change non-breaking).
- DOCA: byte events now emit both a cumulative counter and a last-operation
  gauge (previously counter-xor-gauge); appendGaugeSample takes an explicit
  metric-name argument so the gauge can be named *_last while keyed off the
  byte event.

Adds Prometheus and DOCA tests that inject distinct TX/RX values and assert,
at the live scrape endpoint, that each counter sums every delta while each
*_last gauge tracks only the final operation. Documents the *_last vs *_total
semantics in the exporter READMEs and docs/telemetry.md.

Signed-off-by: Efraim Eygin <eeygin@nvidia.com>
@e-eygin e-eygin self-assigned this Jun 24, 2026
@e-eygin e-eygin requested a review from a team as a code owner June 24, 2026 11:01
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi e-eygin! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

Both the DOCA and Prometheus telemetry exporters are updated to emit a _last gauge series alongside existing cumulative counters for agent_tx_bytes and agent_rx_bytes. A new gaugeMetricName() function replaces the boolean isGaugeEvent(), and gauge registration is split into separate event-name and metric-name parameters. Tests and documentation are updated throughout.

Changes

Last-operation gauge metrics for byte events

Layer / File(s) Summary
Gauge metric naming contracts
src/plugins/telemetry/doca/doca_exporter.h, src/plugins/telemetry/prometheus/prometheus_exporter.h, src/plugins/telemetry/doca/doca_exporter.cpp
gaugeMetricName() replaces isGaugeEvent() and returns the specific gauge series name (including *_last for byte events) or nullptr. Both headers update appendGaugeSample/registerGauge declarations to accept an explicit metric_name parameter.
DOCA exporter: independent counter and gauge emission
src/plugins/telemetry/doca/doca_exporter.cpp
appendGaugeSample passes an explicit metric_name to doca_telemetry_exporter_metrics_add_gauge. exportEvent removes the else if coupling so byte events emit both a counter and a _last gauge in the same call.
Prometheus exporter: event_name / metric_name split
src/plugins/telemetry/prometheus/prometheus_exporter.cpp
registerGauge stores the family in gauges_ under event_name but constructs the Prometheus family under metric_name. initializeMetrics registers agent_tx_bytes_last, agent_rx_bytes_last, and memory gauges with explicit names.
Tests and documentation
test/doca-telemetry/telemetry_doca_nixl_test.cpp, test/gtest/telemetry_prometheus_test.cpp, docs/telemetry.md, src/plugins/telemetry/doca/README.md, src/plugins/telemetry/prometheus/README.md
New DOCA and Prometheus tests verify cumulative counters equal the sum of all deltas while _last gauges equal only the final delta. AgentMetricsAppearInScrape is extended to assert _last series presence. All three documentation files are updated to describe the *_last gauge behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop, hop! The bytes now wear two hats,
A counter that sums, a gauge that tracks the last!
No else if shall tangle our telemetry flow,
_last suffix blooms wherever byte events go.
From DOCA to Prometheus, the metrics align—
This bunny's quite proud of the gaugeMetricName design! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding last-operation byte gauges to telemetry exporters.
Description check ✅ Passed The description follows the template and includes What, Why, and How with implementation and test details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@e-eygin

e-eygin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@ColinNV @ovidiusm please review

Comment thread src/plugins/telemetry/doca/doca_exporter.h Outdated
Comment thread src/plugins/telemetry/doca/doca_exporter.h Outdated
Comment thread src/plugins/telemetry/doca/doca_exporter.cpp Outdated
- appendGaugeSample takes the gauge metric name as const char* instead of
  const std::string&, avoiding a per-sample std::string allocation on the
  export hot path (gaugeMetricName already returns a const char*).
- Restore the original one-line appendGaugeSample header comment and drop the
  trivial counter/gauge comment in exportEvent, per review.

Signed-off-by: Efraim Eygin <eeygin@nvidia.com>
@e-eygin

e-eygin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

/build

@e-eygin

e-eygin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 796f5df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants