Skip to content

Create a metric for enqueued object times for state controllers#758

Open
kensimon wants to merge 1 commit intoNVIDIA:mainfrom
kensimon:enqueued-object-time-metric
Open

Create a metric for enqueued object times for state controllers#758
kensimon wants to merge 1 commit intoNVIDIA:mainfrom
kensimon:enqueued-object-time-metric

Conversation

@kensimon
Copy link
Copy Markdown
Contributor

Description

This creates a metric that samples how long state controller objects (e.g. machines) are currently waiting to be processed after being enqueued. It is gathered immediately before enqueuing metrics, as part of periodic_enqueuer.

This builds the duration stats manually rather than using a native prometheus histogram, because the queue age is essentially "state", not an event. We could instead emit an event for how long an object was waiting every time we enqueue it (and let prometheus make a histogram over those events), but that would only affect the metric once an object actually gets picked up... we want metrics to show a long delay if for some reason the objects are stuck and not processing at all.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Closes #625

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This creates a metric that samples how long state controller objects
(e.g. machines) are currently waiting to be processed after being
enqueued. It is gathered immediately before enqueuing metrics, as part
of periodic_enqueuer.

This builds the duration stats manually rather than using a native
prometheus histogram, because the queue age is essentially "state", not
an event. We could instead emit an event for how long an object was
waiting every time we enqueue it (and let prometheus make a histogram
over those events), but that would only affect the metric
once an object actually gets picked up... we want metrics to show a long
delay if for some reason the objects are stuck and not processing at
all.
@kensimon kensimon requested a review from a team as a code owner March 31, 2026 15:52
@kensimon kensimon requested a review from Matthias247 March 31, 2026 15:52
@kensimon
Copy link
Copy Markdown
Contributor Author

@Matthias247 Interested in your feedback here... particularly whether the enqueuer is the right place to be gathering this metric.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-31 15:55:57 UTC | Commit: d839823

Copy link
Copy Markdown
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought was to use a histogram metric and emit when we dequeue. But then again we wouldn't get times for things we never dequeue. With this approach we do, so I like that part.

However in exchange we don't get accurate minimal wait times. Objects might get queued and dequeued before the next "enqueuer" iteration.

2 compromise solutions that I can think of:

  1. Emit 2 metrics. One which catches the long queue times (maybe just emit agauge with the amount of objects that are enqueued for > 1min or so) and one that emits a native histogram based on when objects are dequeued
  2. Do the collection in the processor instead of the enqueuer to catch the smaller queue times.

p99_wait_seconds: f64,
}

let query = format!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing_started_at is updated two times:

  1. when the object is inserted into the table
  2. when processing is actually inserted

the query will only capture whatever the latest operation is of that.
It might still be good enough to spot problems, but it could also be confusing for operators.

Should we add another queued_at field in the database which is only created on insertion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing_started_at only represents when the processing starts if processed_by IS NULL, so we only catch case 1. That's intentional... the metric here is enqueued time, not processing time (which we have another metric for already), the whole point is to show how long a given object has been waiting to even start getting processed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed the processed_by is null part - that makes sense!

There's still one scenario that we don't catch - and that is the one where the original processor disappears and the object should be taken up by something else. We could extend the condition to

(processed_by IS NULL OR processing_started_at + $1::interval < now())

to also handle that case.

If we want to do that, it might be worth putting that snippet into a const UNPROCESSED_OBJECTS_COND: &str = "(processed_by IS NULL OR processing_started_at + $1::interval < now()) and reuse it between queries.

}

let query = format!(
"SELECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not SQL enough to review this :)
Especially not to tell whether this is more efficient than pulling all rows and doing the aggregation client side

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just did a benchmark of this against a simple in-rust version of the same stats, and looking at 100 runs each:

  • For 100 rows, postgres takes 0.1 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 0.5 milliseconds.
  • For 2,000 rows, postgres takes 0.5 milliseconds (with occasional random spikes to 20ms, twice in 100 runs), where rust takes 3 milliseconds.
  • For 100,000 rows, postgres takes ~28 milliseconds with some light variability, where rust takes ~150 milliseconds.

So I guess it comes down to the overhead of copying the timestamps to/from the database, which should be negligible when the number of of rows is low. Either way postgres doesn't seem to have any trouble at all, and the advantage only gets better as the number of rows increases (we'll likely never have that many rows in production)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the testing! Sounds good to me

"SELECT
COUNT(*) AS num_waiting_objects,
COALESCE(EXTRACT(EPOCH FROM MAX(now() - processing_started_at)), 0)::double precision AS oldest_wait_seconds,
COALESCE(EXTRACT(EPOCH FROM AVG(now() - processing_started_at)), 0)::double precision AS mean_wait_seconds,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe avg_wait_seconds - the prometheus functions are also called avg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a metric for enqueued object times for state controllers

2 participants