Create a metric for enqueued object times for state controllers#758
Create a metric for enqueued object times for state controllers#758kensimon wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
This creates a metric that samples how long state controller objects (e.g. machines) are currently waiting to be processed after being enqueued. It is gathered immediately before enqueuing metrics, as part of periodic_enqueuer. This builds the duration stats manually rather than using a native prometheus histogram, because the queue age is essentially "state", not an event. We could instead emit an event for how long an object was waiting every time we enqueue it (and let prometheus make a histogram over those events), but that would only affect the metric once an object actually gets picked up... we want metrics to show a long delay if for some reason the objects are stuck and not processing at all.
|
@Matthias247 Interested in your feedback here... particularly whether the enqueuer is the right place to be gathering this metric. |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-31 15:55:57 UTC | Commit: d839823 |
Matthias247
left a comment
There was a problem hiding this comment.
My first thought was to use a histogram metric and emit when we dequeue. But then again we wouldn't get times for things we never dequeue. With this approach we do, so I like that part.
However in exchange we don't get accurate minimal wait times. Objects might get queued and dequeued before the next "enqueuer" iteration.
2 compromise solutions that I can think of:
- Emit 2 metrics. One which catches the long queue times (maybe just emit agauge with the amount of objects that are enqueued for > 1min or so) and one that emits a native histogram based on when objects are dequeued
- Do the collection in the processor instead of the enqueuer to catch the smaller queue times.
| p99_wait_seconds: f64, | ||
| } | ||
|
|
||
| let query = format!( |
There was a problem hiding this comment.
processing_started_at is updated two times:
- when the object is inserted into the table
- when processing is actually inserted
the query will only capture whatever the latest operation is of that.
It might still be good enough to spot problems, but it could also be confusing for operators.
Should we add another queued_at field in the database which is only created on insertion?
There was a problem hiding this comment.
processing_started_at only represents when the processing starts if processed_by IS NULL, so we only catch case 1. That's intentional... the metric here is enqueued time, not processing time (which we have another metric for already), the whole point is to show how long a given object has been waiting to even start getting processed.
There was a problem hiding this comment.
Oh I missed the processed_by is null part - that makes sense!
There's still one scenario that we don't catch - and that is the one where the original processor disappears and the object should be taken up by something else. We could extend the condition to
(processed_by IS NULL OR processing_started_at + $1::interval < now())
to also handle that case.
If we want to do that, it might be worth putting that snippet into a const UNPROCESSED_OBJECTS_COND: &str = "(processed_by IS NULL OR processing_started_at + $1::interval < now()) and reuse it between queries.
| } | ||
|
|
||
| let query = format!( | ||
| "SELECT |
There was a problem hiding this comment.
I'm not SQL enough to review this :)
Especially not to tell whether this is more efficient than pulling all rows and doing the aggregation client side
There was a problem hiding this comment.
Hmm, I just did a benchmark of this against a simple in-rust version of the same stats, and looking at 100 runs each:
- For 100 rows, postgres takes 0.1 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 0.5 milliseconds.
- For 2,000 rows, postgres takes 0.5 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 3 milliseconds.
- For 100,000 rows, postgres takes ~28 milliseconds with some light variability, where rust takes ~150 milliseconds.
So I guess it comes down to the overhead of copying the timestamps to/from the database, which should be negligible when the number of of rows is low. Either way postgres doesn't seem to have any trouble at all, and the advantage only gets better as the number of rows increases (we'll likely never have that many rows in production)
There was a problem hiding this comment.
thanks for the testing! Sounds good to me
| "SELECT | ||
| COUNT(*) AS num_waiting_objects, | ||
| COALESCE(EXTRACT(EPOCH FROM MAX(now() - processing_started_at)), 0)::double precision AS oldest_wait_seconds, | ||
| COALESCE(EXTRACT(EPOCH FROM AVG(now() - processing_started_at)), 0)::double precision AS mean_wait_seconds, |
There was a problem hiding this comment.
maybe avg_wait_seconds - the prometheus functions are also called avg
Description
This creates a metric that samples how long state controller objects (e.g. machines) are currently waiting to be processed after being enqueued. It is gathered immediately before enqueuing metrics, as part of periodic_enqueuer.
This builds the duration stats manually rather than using a native prometheus histogram, because the queue age is essentially "state", not an event. We could instead emit an event for how long an object was waiting every time we enqueue it (and let prometheus make a histogram over those events), but that would only affect the metric once an object actually gets picked up... we want metrics to show a long delay if for some reason the objects are stuck and not processing at all.
Type of Change
Related Issues (Optional)
Closes #625
Breaking Changes
Testing
Additional Notes