telemetry: add last-operation byte gauges to exporters#1824
Conversation
Expose agent_tx_bytes_last and agent_rx_bytes_last gauges from both the Prometheus and DOCA exporters so dashboards can read the byte size of the latest TX/RX request alongside the existing cumulative *_total counters. "Last" is derived exporter-side from the per-operation value already carried by AGENT_TX_BYTES / AGENT_RX_BYTES: NIXL is a delta-only producer, so the event value is itself the last operation's value. No new event type, no TELEMETRY_VERSION bump, and no core change are required; the gauge is stateless. - Prometheus: registerGauge now separates the lookup key (event name) from the exposed metric name, registering agent_tx_bytes -> agent_tx_bytes_last and agent_rx_bytes -> agent_rx_bytes_last. Memory gauge names are left unchanged (the *_last rename is deferred to keep this change non-breaking). - DOCA: byte events now emit both a cumulative counter and a last-operation gauge (previously counter-xor-gauge); appendGaugeSample takes an explicit metric-name argument so the gauge can be named *_last while keyed off the byte event. Adds Prometheus and DOCA tests that inject distinct TX/RX values and assert, at the live scrape endpoint, that each counter sums every delta while each *_last gauge tracks only the final operation. Documents the *_last vs *_total semantics in the exporter READMEs and docs/telemetry.md. Signed-off-by: Efraim Eygin <eeygin@nvidia.com>
|
👋 Hi e-eygin! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughBoth the DOCA and Prometheus telemetry exporters are updated to emit a ChangesLast-operation gauge metrics for byte events
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- appendGaugeSample takes the gauge metric name as const char* instead of const std::string&, avoiding a per-sample std::string allocation on the export hot path (gaugeMetricName already returns a const char*). - Restore the original one-line appendGaugeSample header comment and drop the trivial counter/gauge comment in exportEvent, per review. Signed-off-by: Efraim Eygin <eeygin@nvidia.com>
|
/build |
|
/ok to test 796f5df |
What?
Adds two purely-additive last-operation gauges to both the Prometheus and DOCA
telemetry exporters:
agent_tx_bytes_last— byte size of the most recent TX requestagent_rx_bytes_last— byte size of the most recent RX requestThese sit alongside the existing cumulative
*_totalbyte counters(
agent_tx_bytes/agent_rx_bytes), so a counter now reports the runningtotal while the
_lastgauge reports only the latest operation.Implementation:
prometheus_exporter.{h,cpp}):registerGaugenow separatesthe lookup key (event name) from the exposed metric name, registering
agent_tx_bytes → agent_tx_bytes_lastandagent_rx_bytes → agent_rx_bytes_last.exportEvent()is unchanged (it already sets the gauge keyed by event name).doca_exporter.cpp): byte events now emit BOTH a cumulative counterand a last-operation gauge (dispatch was previously counter-xor-gauge);
appendGaugeSampletakes an explicit metric-name argument so the gauge can benamed
*_lastwhile keyed off the byte event.docs/telemetry.md) updated in thesame PR.
Why?
The byte events were routed through the counter path only, so there was no way
to see the size of the latest transfer — only the lifetime total. A PromQL
idelta(*_total[...])derivation answers a different question (bytes per scrapeinterval) and is wrong when the op-rate exceeds the scrape-rate, since NIXL
batches/flushes events (~100 ms). An explicit
_lastgauge is the correctprimitive.
This is zero breaking changes: no new telemetry event type, no
TELEMETRY_VERSIONbump, no core change. "Last" is derived exporter-side fromthe per-op value NIXL already carries (
event.value_), which is delta-only, sothe gauge is stateless. The existing memory gauges are intentionally left under
their current names (
agent_memory_registered/agent_memory_deregistered);the
*_lastrename for those is deferred to a follow-up to avoid a metric-namebreak.
How?
End-to-end tested through the real scrape endpoints (not internal state):
test/gtest/telemetry_prometheus_test.cpp): injects TX1000, 2000, 3500and RX500, 1500, scrapes the live HTTP/metrics, andasserts
agent_tx_bytes_total == 6500/agent_tx_bytes_last == 3500andagent_rx_bytes_total == 2000/agent_rx_bytes_last == 1500.test/doca-telemetry/telemetry_doca_nixl_test.cpp): injects TX10, 20, 35and RX5, 15, flushes, scrapes the live DOCA Prometheusendpoint, and asserts the counters (
65,20) and_lastgauges (35,15).Distinct TX/RX values also guard against cross-wiring.
Local validation: 27 telemetry gtests + 4 DOCA exporter tests pass;
diff-scoped clang-format clean.
Summary by CodeRabbit
New Features
_lastseries for TX/RX byte values.Bug Fixes
Documentation
_lastnaming.