[scheduler] Optimization pass#2411
Draft
DiegoTavares wants to merge 19 commits into
Draft
Conversation
Add an opt-in `Epvm` host-booking strategy alongside the existing `Saturation` strategy (which stays the default). When `Epvm` is enabled, the host cache scores candidate hosts using E-PVM stranding and picks the lowest-scoring host within a configurable scan cap, replacing today's first-fit-by-saturation iteration for the same call.
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
QUERY_LAYERS_WITH_FRAMES orders rows by layer.int_dispatch_order, but group_layers_and_frames collected them into a HashMap and returned into_values(), which iterates in arbitrary order. Layers were therefore processed in random order instead of dispatch order, starving high-priority layers behind later ones within the same job. Group into a Vec indexed by first-seen order instead, preserving the SQL ordering for both layers and their frames. This also stops building a full DispatchLayerModel (a dozen string clones plus job/layer env map clones) for every frame row when only the first row per layer was kept. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gen_cache_keys comparator returned Ordering::Less for both (HostName, Manual) and (Manual, HostName) — the match arms only encoded "X before everything" without ever yielding Greater for the right-hand side. The intended MANUAL > HOSTNAME > HARDWARE > ALLOC checkout priority silently depended on input order, and an inconsistent comparator is allowed to panic in std's sort. Replace it with a rank-based sorted_by_key, which is total by construction and matches the documented priority. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
HostCacheService::check_out could not distinguish "group cached and fresh but no host fits" from "group not cached / data expired" — both came back as None, and both fell through to fetch_group_data, which runs the full host/host_stat/subscription/host_tag join for the whole tag group. On a busy farm the common case is precisely "no candidate right now", so every failed checkout attempt (up to host_candidate_attempts_per_layer per layer, per tag) was hammering Postgres with full group fetches that the 10s refresh loop already performs. Probe the group with a three-way outcome (Found / NoCandidate / Expired). NoCandidate now moves on to the next tag and leaves the refresh loop responsible for data freshness; only absent or expired groups trigger a database fetch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two related index-consistency bugs: 1. HostCache::check_in inserted hosts at the bucket matching their current (idle_cores, idle_memory) without removing the entry at their previous bucket. The periodic refresh re-checks-in every host of every group each monitoring interval, so any host whose idle resources changed left a duplicate entry behind: the index grew without bound, scans iterated dead ids, and hosts could be found through outdated buckets, degrading the bin-packing scan order. Entries for hosts removed by staleness cleanup also lingered forever. 2. HostStore::insert returned `upsert_sync(...).unwrap_or(host)`, but scc's upsert_sync returns the PREVIOUS value when overwriting. Every accepted update therefore handed the OLD host version back to check_in, which indexed the host at its previous resources (caught by the new regression test). Wrap the B-tree in a HostIndex that pairs it with a reverse HostId -> (core_key, memory_key) map: check_in now moves the entry, removals go through one helper that prunes empty buckets, and a periodic sweep (piggy-backing on the stale-host cleanup interval) drops index entries whose hosts are gone from the store. HostStore::insert returns the value actually stored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
atomic_remove_if_valid treated staleness as an unconditional success: it evicted the stale host and returned Ok(Some(host)), which the cache checkout paths interpret as "candidate committed" — handing a host whose reports stopped arriving (likely down or partitioned) straight to the dispatcher, bypassing both the CAS witness and the validation gate. Keep the eviction but return Err(()) so callers treat the host like any other failed candidate and move on to the next one. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
update_resources / restore_resources (and their host_stat variants) bound gpu_memory_reserved in raw bytes against int_gpu_mem_idle / int_gpu_mem_free, which are KB-denominated columns — the same columns the host fetch query reads back via ByteSize::kb(). Main memory on the adjacent bind is correctly divided by KB, as are the GPU memory binds in frame_dao and proc_dao. Booking a GPU frame therefore decremented host GPU memory 1000x too much: the `int_gpu_mem_idle >= $4` guard failed for any subsequent GPU booking on the host (spurious HostResourcesExhausted), and the restore path's `int_gpu_mem_idle + $4 <= int_gpu_mem` guard silently no-opped the compensation. Divide by KB like every other memory bind. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The host dispatch lock used pg_try_advisory_lock, which is session-scoped, while the unlock comment assumed "ending the transaction will automatically unlock" — that is only true for xact-scoped advisory locks. If the explicit pg_advisory_unlock ever failed (connection error, panic path losing the await), the lock stayed attached to the pooled session indefinitely: every later dispatch attempt for that host (or any host whose id hashes to the same 32-bit key) would fail with HostLock until the connection happened to be closed. Switch to pg_try_advisory_xact_lock so the lock is released by the commit/rollback that already ends every dispatch, and drop the manual unlock path entirely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The cluster loop sent FeedMessage::Stop() whenever the pending-jobs query failed. Under load a single pool acquire timeout or a network blip was enough to permanently shut down the entire scheduler feed — the stream ends, run() returns, and the process exits while the database is perfectly healthy seconds later. Log the error and put the failing cluster to sleep for the same back-off used for empty passes; the next wake retries naturally. Persistent outages still surface through the all-clusters-sleeping back-off (and empty_job_cycles_before_quiting in test setups). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A cluster only slept when its job query returned nothing. On a saturated farm — pending jobs everywhere but no host candidate fits — every pass still found jobs, so the loop re-ran the pending-jobs query plus the heavy per-job layers/frames query for every cluster continuously, with only the 10ms round pause between rounds. With 10 shows and hundreds of thousands of waiting frames this is a large, permanent database load that accomplishes no dispatches. Propagate the dispatched-frame count up from process_layer through MatchingService::process, and put the cluster to sleep for the new queue.cluster_saturated_sleep (default 5s) when a pass processed jobs but placed zero frames. The default stays below the host cache refresh interval's order of magnitude, so freed hosts are still picked up promptly while no-op query load drops by orders of magnitude. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
reserve_host used scc insert_sync, which is a no-op when the key already exists. If a previous dispatch left an expired reservation behind (e.g. the dispatch task died before check-in), reserving the host again kept the OLD timestamp: the reservation was already expired the moment it was "taken", leaving the in-flight dispatch unprotected against a concurrent checkout re-introduced by the periodic cache refresh. Upsert instead so every reservation starts a fresh window. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Apply a series of optimizations suggested by Claude Fable. All have been reviewed line by line and tested on a dev environment.
[scheduler] Fix layer dispatch order lost when grouping query rows
[scheduler] Fix tag priority sort that was not a total order
[scheduler] Only fetch host groups from DB when absent or expired
[scheduler] Fix ghost entries and stale re-index in the host cache index
[scheduler] Don't return stale hosts as successful checkout candidates
[scheduler] Fix GPU memory unit mismatch in host resource updates
[scheduler] Use a transaction-scoped advisory lock for host dispatch
[scheduler] Don't stop the whole feed on a transient job query error
[scheduler] Back off clusters when a full pass dispatches zero frames
[scheduler] Refresh host reservations on re-reserve
[scheduler] Apply rustfmt to recently touched lines