Skip to content

[scheduler] Optimization pass#2411

Draft
DiegoTavares wants to merge 19 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:optimize
Draft

[scheduler] Optimization pass#2411
DiegoTavares wants to merge 19 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:optimize

Conversation

Add an opt-in `Epvm` host-booking strategy alongside the existing `Saturation`
strategy (which stays the default). When `Epvm` is enabled, the host cache
scores candidate hosts using E-PVM stranding and picks the lowest-scoring host
within a configurable scan cap, replacing today's first-fit-by-saturation
iteration for the same call.
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78322490-8646-48b4-8631-61e6c31dd9cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

DiegoTavares and others added 12 commits June 11, 2026 14:42
QUERY_LAYERS_WITH_FRAMES orders rows by layer.int_dispatch_order, but
group_layers_and_frames collected them into a HashMap and returned
into_values(), which iterates in arbitrary order. Layers were therefore
processed in random order instead of dispatch order, starving
high-priority layers behind later ones within the same job.

Group into a Vec indexed by first-seen order instead, preserving the SQL
ordering for both layers and their frames. This also stops building a
full DispatchLayerModel (a dozen string clones plus job/layer env map
clones) for every frame row when only the first row per layer was kept.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gen_cache_keys comparator returned Ordering::Less for both
(HostName, Manual) and (Manual, HostName) — the match arms only encoded
"X before everything" without ever yielding Greater for the right-hand
side. The intended MANUAL > HOSTNAME > HARDWARE > ALLOC checkout
priority silently depended on input order, and an inconsistent
comparator is allowed to panic in std's sort.

Replace it with a rank-based sorted_by_key, which is total by
construction and matches the documented priority.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
HostCacheService::check_out could not distinguish "group cached and
fresh but no host fits" from "group not cached / data expired" — both
came back as None, and both fell through to fetch_group_data, which runs
the full host/host_stat/subscription/host_tag join for the whole tag
group. On a busy farm the common case is precisely "no candidate right
now", so every failed checkout attempt (up to
host_candidate_attempts_per_layer per layer, per tag) was hammering
Postgres with full group fetches that the 10s refresh loop already
performs.

Probe the group with a three-way outcome (Found / NoCandidate /
Expired). NoCandidate now moves on to the next tag and leaves the
refresh loop responsible for data freshness; only absent or expired
groups trigger a database fetch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two related index-consistency bugs:

1. HostCache::check_in inserted hosts at the bucket matching their
   current (idle_cores, idle_memory) without removing the entry at their
   previous bucket. The periodic refresh re-checks-in every host of
   every group each monitoring interval, so any host whose idle
   resources changed left a duplicate entry behind: the index grew
   without bound, scans iterated dead ids, and hosts could be found
   through outdated buckets, degrading the bin-packing scan order.
   Entries for hosts removed by staleness cleanup also lingered forever.

2. HostStore::insert returned `upsert_sync(...).unwrap_or(host)`, but
   scc's upsert_sync returns the PREVIOUS value when overwriting. Every
   accepted update therefore handed the OLD host version back to
   check_in, which indexed the host at its previous resources (caught by
   the new regression test).

Wrap the B-tree in a HostIndex that pairs it with a reverse
HostId -> (core_key, memory_key) map: check_in now moves the entry,
removals go through one helper that prunes empty buckets, and a periodic
sweep (piggy-backing on the stale-host cleanup interval) drops index
entries whose hosts are gone from the store. HostStore::insert returns
the value actually stored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
atomic_remove_if_valid treated staleness as an unconditional success:
it evicted the stale host and returned Ok(Some(host)), which the cache
checkout paths interpret as "candidate committed" — handing a host whose
reports stopped arriving (likely down or partitioned) straight to the
dispatcher, bypassing both the CAS witness and the validation gate.

Keep the eviction but return Err(()) so callers treat the host like any
other failed candidate and move on to the next one.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
update_resources / restore_resources (and their host_stat variants)
bound gpu_memory_reserved in raw bytes against int_gpu_mem_idle /
int_gpu_mem_free, which are KB-denominated columns — the same columns
the host fetch query reads back via ByteSize::kb(). Main memory on the
adjacent bind is correctly divided by KB, as are the GPU memory binds in
frame_dao and proc_dao.

Booking a GPU frame therefore decremented host GPU memory 1000x too
much: the `int_gpu_mem_idle >= $4` guard failed for any subsequent GPU
booking on the host (spurious HostResourcesExhausted), and the restore
path's `int_gpu_mem_idle + $4 <= int_gpu_mem` guard silently no-opped
the compensation. Divide by KB like every other memory bind.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The host dispatch lock used pg_try_advisory_lock, which is
session-scoped, while the unlock comment assumed "ending the transaction
will automatically unlock" — that is only true for xact-scoped advisory
locks. If the explicit pg_advisory_unlock ever failed (connection error,
panic path losing the await), the lock stayed attached to the pooled
session indefinitely: every later dispatch attempt for that host (or any
host whose id hashes to the same 32-bit key) would fail with HostLock
until the connection happened to be closed.

Switch to pg_try_advisory_xact_lock so the lock is released by the
commit/rollback that already ends every dispatch, and drop the manual
unlock path entirely.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The cluster loop sent FeedMessage::Stop() whenever the pending-jobs
query failed. Under load a single pool acquire timeout or a network blip
was enough to permanently shut down the entire scheduler feed — the
stream ends, run() returns, and the process exits while the database is
perfectly healthy seconds later.

Log the error and put the failing cluster to sleep for the same back-off
used for empty passes; the next wake retries naturally. Persistent
outages still surface through the all-clusters-sleeping back-off (and
empty_job_cycles_before_quiting in test setups).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A cluster only slept when its job query returned nothing. On a saturated
farm — pending jobs everywhere but no host candidate fits — every pass
still found jobs, so the loop re-ran the pending-jobs query plus the
heavy per-job layers/frames query for every cluster continuously, with
only the 10ms round pause between rounds. With 10 shows and hundreds of
thousands of waiting frames this is a large, permanent database load
that accomplishes no dispatches.

Propagate the dispatched-frame count up from process_layer through
MatchingService::process, and put the cluster to sleep for the new
queue.cluster_saturated_sleep (default 5s) when a pass processed jobs
but placed zero frames. The default stays below the host cache refresh
interval's order of magnitude, so freed hosts are still picked up
promptly while no-op query load drops by orders of magnitude.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
reserve_host used scc insert_sync, which is a no-op when the key already
exists. If a previous dispatch left an expired reservation behind (e.g.
the dispatch task died before check-in), reserving the host again kept
the OLD timestamp: the reservation was already expired the moment it was
"taken", leaving the in-flight dispatch unprotected against a concurrent
checkout re-introduced by the periodic cache refresh. Upsert instead so
every reservation starts a fresh window.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant