[scheduler] Optimization pass by DiegoTavares · Pull Request #2411 · AcademySoftwareFoundation/OpenCue

DiegoTavares · 2026-06-11T20:58:40Z

Apply a series of optimizations suggested by Claude Fable. All have been reviewed line by line and tested on a dev environment.

[scheduler] Fix layer dispatch order lost when grouping query rows
[scheduler] Fix tag priority sort that was not a total order
[scheduler] Only fetch host groups from DB when absent or expired
[scheduler] Fix ghost entries and stale re-index in the host cache index
[scheduler] Don't return stale hosts as successful checkout candidates
[scheduler] Fix GPU memory unit mismatch in host resource updates
[scheduler] Use a transaction-scoped advisory lock for host dispatch
[scheduler] Don't stop the whole feed on a transient job query error
[scheduler] Back off clusters when a full pass dispatches zero frames
[scheduler] Refresh host reservations on re-reserve
[scheduler] Apply rustfmt to recently touched lines

Add an opt-in `Epvm` host-booking strategy alongside the existing `Saturation` strategy (which stays the default). When `Epvm` is enabled, the host cache scores candidate hosts using E-PVM stranding and picks the lowest-scoring host within a configurable scan cap, replacing today's first-fit-by-saturation iteration for the same call.

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

coderabbitai · 2026-06-11T20:58:48Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78322490-8646-48b4-8631-61e6c31dd9cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

QUERY_LAYERS_WITH_FRAMES orders rows by layer.int_dispatch_order, but group_layers_and_frames collected them into a HashMap and returned into_values(), which iterates in arbitrary order. Layers were therefore processed in random order instead of dispatch order, starving high-priority layers behind later ones within the same job. Group into a Vec indexed by first-seen order instead, preserving the SQL ordering for both layers and their frames. This also stops building a full DispatchLayerModel (a dozen string clones plus job/layer env map clones) for every frame row when only the first row per layer was kept. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The gen_cache_keys comparator returned Ordering::Less for both (HostName, Manual) and (Manual, HostName) — the match arms only encoded "X before everything" without ever yielding Greater for the right-hand side. The intended MANUAL > HOSTNAME > HARDWARE > ALLOC checkout priority silently depended on input order, and an inconsistent comparator is allowed to panic in std's sort. Replace it with a rank-based sorted_by_key, which is total by construction and matches the documented priority. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

HostCacheService::check_out could not distinguish "group cached and fresh but no host fits" from "group not cached / data expired" — both came back as None, and both fell through to fetch_group_data, which runs the full host/host_stat/subscription/host_tag join for the whole tag group. On a busy farm the common case is precisely "no candidate right now", so every failed checkout attempt (up to host_candidate_attempts_per_layer per layer, per tag) was hammering Postgres with full group fetches that the 10s refresh loop already performs. Probe the group with a three-way outcome (Found / NoCandidate / Expired). NoCandidate now moves on to the next tag and leaves the refresh loop responsible for data freshness; only absent or expired groups trigger a database fetch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two related index-consistency bugs: 1. HostCache::check_in inserted hosts at the bucket matching their current (idle_cores, idle_memory) without removing the entry at their previous bucket. The periodic refresh re-checks-in every host of every group each monitoring interval, so any host whose idle resources changed left a duplicate entry behind: the index grew without bound, scans iterated dead ids, and hosts could be found through outdated buckets, degrading the bin-packing scan order. Entries for hosts removed by staleness cleanup also lingered forever. 2. HostStore::insert returned `upsert_sync(...).unwrap_or(host)`, but scc's upsert_sync returns the PREVIOUS value when overwriting. Every accepted update therefore handed the OLD host version back to check_in, which indexed the host at its previous resources (caught by the new regression test). Wrap the B-tree in a HostIndex that pairs it with a reverse HostId -> (core_key, memory_key) map: check_in now moves the entry, removals go through one helper that prunes empty buckets, and a periodic sweep (piggy-backing on the stale-host cleanup interval) drops index entries whose hosts are gone from the store. HostStore::insert returns the value actually stored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

atomic_remove_if_valid treated staleness as an unconditional success: it evicted the stale host and returned Ok(Some(host)), which the cache checkout paths interpret as "candidate committed" — handing a host whose reports stopped arriving (likely down or partitioned) straight to the dispatcher, bypassing both the CAS witness and the validation gate. Keep the eviction but return Err(()) so callers treat the host like any other failed candidate and move on to the next one. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

update_resources / restore_resources (and their host_stat variants) bound gpu_memory_reserved in raw bytes against int_gpu_mem_idle / int_gpu_mem_free, which are KB-denominated columns — the same columns the host fetch query reads back via ByteSize::kb(). Main memory on the adjacent bind is correctly divided by KB, as are the GPU memory binds in frame_dao and proc_dao. Booking a GPU frame therefore decremented host GPU memory 1000x too much: the `int_gpu_mem_idle >= $4` guard failed for any subsequent GPU booking on the host (spurious HostResourcesExhausted), and the restore path's `int_gpu_mem_idle + $4 <= int_gpu_mem` guard silently no-opped the compensation. Divide by KB like every other memory bind. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The host dispatch lock used pg_try_advisory_lock, which is session-scoped, while the unlock comment assumed "ending the transaction will automatically unlock" — that is only true for xact-scoped advisory locks. If the explicit pg_advisory_unlock ever failed (connection error, panic path losing the await), the lock stayed attached to the pooled session indefinitely: every later dispatch attempt for that host (or any host whose id hashes to the same 32-bit key) would fail with HostLock until the connection happened to be closed. Switch to pg_try_advisory_xact_lock so the lock is released by the commit/rollback that already ends every dispatch, and drop the manual unlock path entirely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The cluster loop sent FeedMessage::Stop() whenever the pending-jobs query failed. Under load a single pool acquire timeout or a network blip was enough to permanently shut down the entire scheduler feed — the stream ends, run() returns, and the process exits while the database is perfectly healthy seconds later. Log the error and put the failing cluster to sleep for the same back-off used for empty passes; the next wake retries naturally. Persistent outages still surface through the all-clusters-sleeping back-off (and empty_job_cycles_before_quiting in test setups). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A cluster only slept when its job query returned nothing. On a saturated farm — pending jobs everywhere but no host candidate fits — every pass still found jobs, so the loop re-ran the pending-jobs query plus the heavy per-job layers/frames query for every cluster continuously, with only the 10ms round pause between rounds. With 10 shows and hundreds of thousands of waiting frames this is a large, permanent database load that accomplishes no dispatches. Propagate the dispatched-frame count up from process_layer through MatchingService::process, and put the cluster to sleep for the new queue.cluster_saturated_sleep (default 5s) when a pass processed jobs but placed zero frames. The default stays below the host cache refresh interval's order of magnitude, so freed hosts are still picked up promptly while no-op query load drops by orders of magnitude. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

reserve_host used scc insert_sync, which is a no-op when the key already exists. If a previous dispatch left an expired reservation behind (e.g. the dispatch task died before check-in), reserving the host again kept the OLD timestamp: the reservation was already expired the moment it was "taken", leaving the in-flight dispatch unprotected against a concurrent checkout re-introduced by the periodic cache refresh. Upsert instead so every reservation starts a fresh window. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

DiegoTavares added 4 commits June 9, 2026 15:28

Minor fixes after first review pass

ed9ac44

Fix conflict left over mark

2557c62

Merge branch 'master' into sched_e-pvm

15250c3

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

DiegoTavares and others added 12 commits June 11, 2026 14:42

Fix unit test broken by upstream sync

3147111

[scheduler] Apply rustfmt to recently touched lines

658248d

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

DiegoTavares force-pushed the optimize branch from 3e54401 to 658248d Compare June 11, 2026 22:03

DiegoTavares added 3 commits June 11, 2026 15:25

Fix jr.int_max_cores casting error

92d24cc

[scheduler] Improve logging to stop spamming every dispatch

73b81e9

Log clean up pass

2e4c73a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scheduler] Optimization pass#2411

[scheduler] Optimization pass#2411
DiegoTavares wants to merge 19 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:optimize

DiegoTavares commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DiegoTavares commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DiegoTavares commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading