Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination by glaziermag · Pull Request #2034 · EricLBuehler/mistral.rs

glaziermag · 2026-03-28T02:44:15Z

Issue
Closes #2033. Under intense concurrent generation requests over constrained VRAM boundaries, users observed catastrophic request starvation causing large context loads to time out.

Resolution & Algorithm Overhaul
This PR re-architects the PagedAttention preemption pipeline to uphold First-Come-First-Serve (FCFS) prioritization and eradicate Thread-Spin scheduling locks.

The primary driver of the starvation was an architectural conflict between strict FCFS request priority and the varlen batch-padding restrictions inherent to Flash Attention. The bucketing logic utilized a Shortest-Job-First (SJF) configuration by trying to group sequences that matched the smallest length. This resulted in large, older completion sequences (evicted to the waiting pool) remaining starved because they were larger than any fresh prefilled sub-prompt. By changing the discriminator heuristic to target buckets containing the oldest chronological timestamp across the pool, sequence eviction pivots to suspending new short requests rather than locking out the oldest generations.

To stabilize decoding throughput, the bucket_and_preempt_sequences constraint drops during completion evaluation stages. Multimodal compatibility checks have been detached inside decoding phases logic—incompatible structures (e.g., image vs text constraints in the same chunk) are appended to a deferred_running queue, eliminating the infinite sequence-checking CPU spinlocks that dropped execution queues.

The core KV allocator bounds have been fortified against drop-starvation triggers. The memory preemption loop walks down the queue to clear multiple lower-priority sequence contexts until sufficient matrix blocks are available. The threads suspend into evaluation hold states as opposed to terminating sequences as FinishedIgnored after isolated allocation failures.

Side Effect & Regression Analysis

Per heavy E2E boundary testing on SM89 L4 constraints:

Background Error Rates (0% Regression): Zero runtime panics or preemption warnings logged during mass sequence stress loading.
Memory/CPU Footprint (Stabilized): By moving mismatched sequence compatibility logic to deferred_running, the 100% vCPU iteration thread deadlock was eliminated.
Throughput Impact (Acceleration): By enabling jagged decoding completions alongside strict chronological extraction boundaries, execution evaded iteration timeouts.

Empirical Execution Ground Truth (Before vs After)

Before (Unmodified baseline):
Running 40 massive identical generation sequences (4000 total completion tokens) over local memory constraints. Sequence length discrimination caused intense fragmentation—starving 24+ of the newest/longest sequences!

Spawning 40 concurrent generation tasks to forcibly trigger the VRAM Block allocator eviction preemption limit...
Request 15: 246.52s
Request 16: STARVED (Timeout)
Request 17: STARVED (Timeout)
Request 18: STARVED (Timeout)
...
Request 30: STARVED (Timeout)
...
Request 39: STARVED (Timeout)

After (With this PR applied):
The completion constraints bypass adheres to strict extraction, yielding execution metrics free of timeouts. 100% Sequence execution achieved under peak capacity!

Spawning 40 concurrent generation tasks to forcibly trigger the VRAM Block allocator eviction preemption limit...
--- Benchmark Finished in 195.68 seconds ---
Average Latency: 148.58s
Min Latency (Fastest): 90.05s
Max Latency (Slowest/Starved): 192.26s
Seq 00: 90.15s
Seq 01: 90.05s
Seq 02: 90.99s
Seq 03: 90.89s
Seq 04: 91.83s
Seq 05: 91.73s
Seq 06: 91.63s
Seq 07: 92.62s
Seq 16: 181.44s
Seq 17: 181.34s
...
Seq 32: 192.26s
Seq 33: 192.16s
...
Seq 38: 191.78s
Seq 39: 191.74s

… abandoning length-based Prompt bucketing during Decoding

…nd Infinite Thread Spin

…upstream PRs EricLBuehler#2031/EricLBuehler#2034)

…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.

…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

glaziermag marked this pull request as draft March 28, 2026 02:46

glaziermag changed the title ~~Fix: Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033)~~ Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Mar 28, 2026

glaziermag changed the title ~~Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033)~~ Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination Mar 28, 2026

Fix EricLBuehler#2033: Web Chat id Path Traversal

31d2e1a

glaziermag force-pushed the fix-2033 branch from 9bff30d to 31d2e1a Compare March 28, 2026 02:51

glaziermag closed this Mar 28, 2026

glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026

Fix EricLBuehler#2034: Resolve FCFS Completion Starvation by natively…

b817254

… abandoning length-based Prompt bucketing during Decoding

Fix EricLBuehler#2034: Resolve FCFS Completion Starvation by natively…

b7b973a

… abandoning length-based Prompt bucketing during Decoding

glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026

Fix EricLBuehler#2034: Resolve PagedAttention Completion Starvation a…

c0bd884

…nd Infinite Thread Spin

Fix EricLBuehler#2033: Resolve PagedAttention Completion Starvation a…

5b156e7

…nd Infinite Thread Spin

glaziermag reopened this Mar 28, 2026

glaziermag marked this pull request as ready for review March 28, 2026 22:42

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 29, 2026

PagedAttention scheduler: fix O(N^2) thrashing + FCFS priority (from …

53da949

…upstream PRs EricLBuehler#2031/EricLBuehler#2034)

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026

Revert "PagedAttention scheduler: fix O(N^2) thrashing + FCFS priorit…

4713992

…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026

Reapply "PagedAttention scheduler: fix O(N^2) thrashing + FCFS priori…

5533806

…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.

emanueleDiVizio mentioned this pull request Apr 2, 2026

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes #2047

Open

chore: strip unrelated UI path traversal fixes from scheduler rewrite PR

376f0c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033

glaziermag commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glaziermag commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Side Effect & Regression Analysis

Empirical Execution Ground Truth (Before vs After)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glaziermag commented Mar 28, 2026 •

edited

Loading