Fix PagedAttention Scheduler O(N^2) Thrashing#2031
Open
glaziermag wants to merge 2 commits intoEricLBuehler:masterfrom
Open
Fix PagedAttention Scheduler O(N^2) Thrashing#2031glaziermag wants to merge 2 commits intoEricLBuehler:masterfrom
glaziermag wants to merge 2 commits intoEricLBuehler:masterfrom
Conversation
ad04e75 to
c3e142c
Compare
8339f6d to
0b25ba8
Compare
0b25ba8 to
0773c31
Compare
…ching preemptions
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 29, 2026
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2024.
Context
The
PagedAttentionSchedulerpreviously invokedbucket_and_preempt_sequences()during the Completion scheduling phase. Since PagedAttention backends inherently support variable sequence lengths during decoding via block tables, strict length bucketing during the completion phase is unnecessary.This behavior caused a severe$O(N^2)$ memory thrashing issue: running completions with misaligned lengths were selectively filtered into non-matching buckets, preempted back to the
Waitingstate, and their KV caches were evicted. On the subsequent tick, the engine was forced to redundantly re-prefill their entire query contexts, severely degrading continuous batching performance.Affected Workloads
This issue affected nearly 100% of continuous batching workloads that served >1 concurrent request. Because standard requests inevitably diverge in generated token length during decoding, they quickly drift into different length buckets, triggering the preemption loop. Single-batch workloads (batch size 1) were unaffected.
The Fix
This PR modifies
scheduler.rsso thatbucket_and_preempt_sequencesbehaves safely during completions. Instead of forcefully executing_preempt()across mismatched lengths or modalities, unmatched sequences are deferred to the next tick (deferred_running). This natively leverages PyTorch/Candle's ability to handle jagged context sizes without dropping generation caches or triggering PyTorchindex-selectbounds mismatches (which can occur in M-RoPE models like Qwen2-VL when modalities are improperly mixed).Benchmarks & Stability Proofs
To confirm the fix and ensure no regressions in mixed-modality workloads, tests were run on an L4 GPU (CUDA 12.4.1) using
Qwen/Qwen2.5-0.5B-InstructandQwen/Qwen2-VL-2B-Instructwith--prefix-cache-n 10enabled.Case 1: The O(N^2) Allocator Thrashing (Issue #2024)
Testing with 6 concurrent, heavy text-generation requests of intentionally misaligned lengths:
Before Log:
After Log:
Case 2: Multi-Modal Stability
Testing an asynchronous VLM execution (
has_images=true) alongside disparate Prefix-Cache Text requests (has_images=false).Before (Unpatched
master) Bounds Panic:Note: Without the subset deferrals, batching Vision payloads alongside text lengths instantly triggered a CUDA assert panic across the M-RoPE boundary.
After (Patched
fix-issue-2024-pa-allocator) Native Isolation:Note: The unbucketing isolation routes modalities safely without preempting caches. The pipeline generated concurrent inferences successfully.
Case 2 Python Test Script