Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
Open
glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
Open
Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
Conversation
glaziermag
added a commit
to glaziermag/mistral.rs
that referenced
this pull request
Mar 28, 2026
… abandoning length-based Prompt bucketing during Decoding
… abandoning length-based Prompt bucketing during Decoding
glaziermag
added a commit
to glaziermag/mistral.rs
that referenced
this pull request
Mar 28, 2026
…nd Infinite Thread Spin
…nd Infinite Thread Spin
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 29, 2026
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Closes #2033. Under intense concurrent generation requests over constrained VRAM boundaries, users observed catastrophic request starvation causing large context loads to time out.
Resolution & Algorithm Overhaul
This PR re-architects the PagedAttention preemption pipeline to uphold First-Come-First-Serve (FCFS) prioritization and eradicate Thread-Spin scheduling locks.
The primary driver of the starvation was an architectural conflict between strict FCFS request priority and the varlen batch-padding restrictions inherent to Flash Attention. The bucketing logic utilized a Shortest-Job-First (SJF) configuration by trying to group sequences that matched the smallest length. This resulted in large, older completion sequences (evicted to the waiting pool) remaining starved because they were larger than any fresh prefilled sub-prompt. By changing the discriminator heuristic to target buckets containing the oldest chronological timestamp across the pool, sequence eviction pivots to suspending new short requests rather than locking out the oldest generations.
To stabilize decoding throughput, the
bucket_and_preempt_sequencesconstraint drops during completion evaluation stages. Multimodal compatibility checks have been detached inside decoding phases logic—incompatible structures (e.g., image vs text constraints in the same chunk) are appended to adeferred_runningqueue, eliminating the infinite sequence-checking CPU spinlocks that dropped execution queues.The core KV allocator bounds have been fortified against drop-starvation triggers. The memory preemption loop walks down the queue to clear multiple lower-priority sequence contexts until sufficient matrix blocks are available. The threads suspend into evaluation hold states as opposed to terminating sequences as
FinishedIgnoredafter isolated allocation failures.Side Effect & Regression Analysis
Per heavy E2E boundary testing on SM89 L4 constraints:
deferred_running, the 100% vCPU iteration thread deadlock was eliminated.Empirical Execution Ground Truth (Before vs After)
Before (Unmodified baseline):
Running 40 massive identical generation sequences (4000 total completion tokens) over local memory constraints. Sequence length discrimination caused intense fragmentation—starving 24+ of the newest/longest sequences!
After (With this PR applied):
The completion constraints bypass adheres to strict extraction, yielding execution metrics free of timeouts. 100% Sequence execution achieved under peak capacity!