Create a metric for enqueued object times for state controllers by kensimon · Pull Request #758 · NVIDIA/ncx-infra-controller-core

kensimon · 2026-03-31T15:52:19Z

Description

This creates a metric that samples how long state controller objects (e.g. machines) are currently waiting to be processed after being enqueued. It is gathered immediately before enqueuing metrics, as part of periodic_enqueuer.

This builds the duration stats manually rather than using a native prometheus histogram, because the queue age is essentially "state", not an event. We could instead emit an event for how long an object was waiting every time we enqueue it (and let prometheus make a histogram over those events), but that would only affect the metric once an object actually gets picked up... we want metrics to show a long delay if for some reason the objects are stuck and not processing at all.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Closes #625

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

This creates a metric that samples how long state controller objects (e.g. machines) are currently waiting to be processed after being enqueued. It is gathered immediately before enqueuing metrics, as part of periodic_enqueuer. This builds the duration stats manually rather than using a native prometheus histogram, because the queue age is essentially "state", not an event. We could instead emit an event for how long an object was waiting every time we enqueue it (and let prometheus make a histogram over those events), but that would only affect the metric once an object actually gets picked up... we want metrics to show a long delay if for some reason the objects are stuck and not processing at all.

kensimon · 2026-03-31T15:53:22Z

@Matthias247 Interested in your feedback here... particularly whether the enqueuer is the right place to be gathering this metric.

github-actions · 2026-03-31T15:55:58Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-31 15:55:57 UTC | Commit: d839823}

Matthias247

My first thought was to use a histogram metric and emit when we dequeue. But then again we wouldn't get times for things we never dequeue. With this approach we do, so I like that part.

However in exchange we don't get accurate minimal wait times. Objects might get queued and dequeued before the next "enqueuer" iteration.

2 compromise solutions that I can think of:

Emit 2 metrics. One which catches the long queue times (maybe just emit agauge with the amount of objects that are enqueued for > 1min or so) and one that emits a native histogram based on when objects are dequeued
Do the collection in the processor instead of the enqueuer to catch the smaller queue times.

Matthias247 · 2026-03-31T16:43:36Z

crates/api/src/state_controller/controller/db.rs

+        p99_wait_seconds: f64,
+    }
+
+    let query = format!(


processing_started_at is updated two times:

when the object is inserted into the table

when processing is actually inserted

the query will only capture whatever the latest operation is of that.
It might still be good enough to spot problems, but it could also be confusing for operators.

Should we add another queued_at field in the database which is only created on insertion?

processing_started_at only represents when the processing starts if processed_by IS NULL, so we only catch case 1. That's intentional... the metric here is enqueued time, not processing time (which we have another metric for already), the whole point is to show how long a given object has been waiting to even start getting processed.

Oh I missed the processed_by is null part - that makes sense!

There's still one scenario that we don't catch - and that is the one where the original processor disappears and the object should be taken up by something else. We could extend the condition to

(processed_by IS NULL OR processing_started_at + $1::interval < now())

to also handle that case.

If we want to do that, it might be worth putting that snippet into a const UNPROCESSED_OBJECTS_COND: &str = "(processed_by IS NULL OR processing_started_at + $1::interval < now()) and reuse it between queries.

Matthias247 · 2026-03-31T16:44:12Z

crates/api/src/state_controller/controller/db.rs

+    }
+
+    let query = format!(
+        "SELECT


I'm not SQL enough to review this :)
Especially not to tell whether this is more efficient than pulling all rows and doing the aggregation client side

Hmm, I just did a benchmark of this against a simple in-rust version of the same stats, and looking at 100 runs each:

For 100 rows, postgres takes 0.1 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 0.5 milliseconds.

For 2,000 rows, postgres takes 0.5 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 3 milliseconds.

For 100,000 rows, postgres takes ~28 milliseconds with some light variability, where rust takes ~150 milliseconds.

So I guess it comes down to the overhead of copying the timestamps to/from the database, which should be negligible when the number of of rows is low. Either way postgres doesn't seem to have any trouble at all, and the advantage only gets better as the number of rows increases (we'll likely never have that many rows in production)

thanks for the testing! Sounds good to me

Matthias247 · 2026-03-31T16:44:31Z

crates/api/src/state_controller/controller/db.rs

+        "SELECT
+            COUNT(*) AS num_waiting_objects,
+            COALESCE(EXTRACT(EPOCH FROM MAX(now() - processing_started_at)), 0)::double precision AS oldest_wait_seconds,
+            COALESCE(EXTRACT(EPOCH FROM AVG(now() - processing_started_at)), 0)::double precision AS mean_wait_seconds,


maybe avg_wait_seconds - the prometheus functions are also called avg

kensimon requested a review from a team as a code owner March 31, 2026 15:52

kensimon requested a review from Matthias247 March 31, 2026 15:52

Matthias247 reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a metric for enqueued object times for state controllers#758

Create a metric for enqueued object times for state controllers#758
kensimon wants to merge 1 commit intoNVIDIA:mainfrom
kensimon:enqueued-object-time-metric

kensimon commented Mar 31, 2026

Uh oh!

kensimon commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Matthias247 left a comment

Uh oh!

Matthias247 Mar 31, 2026

Uh oh!

kensimon Mar 31, 2026

Uh oh!

Matthias247 Mar 31, 2026

Uh oh!

Matthias247 Mar 31, 2026

Uh oh!

kensimon Mar 31, 2026

Uh oh!

Matthias247 Mar 31, 2026

Uh oh!

Matthias247 Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kensimon commented Mar 31, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

kensimon commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

🔐 TruffleHog Secret Scan

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kensimon Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kensimon Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants