Skip to content

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034

Open
glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033
Open

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
glaziermag wants to merge 4 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Mar 28, 2026

Issue
Closes #2033. Under intense concurrent generation requests over constrained VRAM boundaries, users observed catastrophic request starvation causing large context loads to time out.

Resolution & Algorithm Overhaul
This PR re-architects the PagedAttention preemption pipeline to uphold First-Come-First-Serve (FCFS) prioritization and eradicate Thread-Spin scheduling locks.

The primary driver of the starvation was an architectural conflict between strict FCFS request priority and the varlen batch-padding restrictions inherent to Flash Attention. The bucketing logic utilized a Shortest-Job-First (SJF) configuration by trying to group sequences that matched the smallest length. This resulted in large, older completion sequences (evicted to the waiting pool) remaining starved because they were larger than any fresh prefilled sub-prompt. By changing the discriminator heuristic to target buckets containing the oldest chronological timestamp across the pool, sequence eviction pivots to suspending new short requests rather than locking out the oldest generations.

To stabilize decoding throughput, the bucket_and_preempt_sequences constraint drops during completion evaluation stages. Multimodal compatibility checks have been detached inside decoding phases logic—incompatible structures (e.g., image vs text constraints in the same chunk) are appended to a deferred_running queue, eliminating the infinite sequence-checking CPU spinlocks that dropped execution queues.

The core KV allocator bounds have been fortified against drop-starvation triggers. The memory preemption loop walks down the queue to clear multiple lower-priority sequence contexts until sufficient matrix blocks are available. The threads suspend into evaluation hold states as opposed to terminating sequences as FinishedIgnored after isolated allocation failures.

Side Effect & Regression Analysis

Per heavy E2E boundary testing on SM89 L4 constraints:

  1. Background Error Rates (0% Regression): Zero runtime panics or preemption warnings logged during mass sequence stress loading.
  2. Memory/CPU Footprint (Stabilized): By moving mismatched sequence compatibility logic to deferred_running, the 100% vCPU iteration thread deadlock was eliminated.
  3. Throughput Impact (Acceleration): By enabling jagged decoding completions alongside strict chronological extraction boundaries, execution evaded iteration timeouts.

Empirical Execution Ground Truth (Before vs After)

Before (Unmodified baseline):
Running 40 massive identical generation sequences (4000 total completion tokens) over local memory constraints. Sequence length discrimination caused intense fragmentation—starving 24+ of the newest/longest sequences!

Spawning 40 concurrent generation tasks to forcibly trigger the VRAM Block allocator eviction preemption limit...
Request 15: 246.52s
Request 16: STARVED (Timeout)
Request 17: STARVED (Timeout)
Request 18: STARVED (Timeout)
...
Request 30: STARVED (Timeout)
...
Request 39: STARVED (Timeout)

After (With this PR applied):
The completion constraints bypass adheres to strict extraction, yielding execution metrics free of timeouts. 100% Sequence execution achieved under peak capacity!

Spawning 40 concurrent generation tasks to forcibly trigger the VRAM Block allocator eviction preemption limit...
--- Benchmark Finished in 195.68 seconds ---
Average Latency: 148.58s
Min Latency (Fastest): 90.05s
Max Latency (Slowest/Starved): 192.26s
Seq 00: 90.15s
Seq 01: 90.05s
Seq 02: 90.99s
Seq 03: 90.89s
Seq 04: 91.83s
Seq 05: 91.73s
Seq 06: 91.63s
Seq 07: 92.62s
Seq 16: 181.44s
Seq 17: 181.34s
...
Seq 32: 192.26s
Seq 33: 192.16s
...
Seq 38: 191.78s
Seq 39: 191.74s

@glaziermag glaziermag marked this pull request as draft March 28, 2026 02:46
@glaziermag glaziermag changed the title Fix: Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Mar 28, 2026
@glaziermag glaziermag changed the title Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination Mar 28, 2026
@glaziermag glaziermag closed this Mar 28, 2026
glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026
… abandoning length-based Prompt bucketing during Decoding
… abandoning length-based Prompt bucketing during Decoding
glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026
@glaziermag glaziermag reopened this Mar 28, 2026
@glaziermag glaziermag marked this pull request as ready for review March 28, 2026 22:42
emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026
emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026
emanueleDiVizio added a commit to emanueleDiVizio/mistral.rs that referenced this pull request Apr 2, 2026
…duler

Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling
complexity when sequences are waiting, and add FCFS priority ordering
to prevent starvation.
emanueleDiVizio added a commit to emanueleDiVizio/mistral.rs that referenced this pull request Apr 2, 2026
…duler

Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling
complexity when sequences are waiting, and add FCFS priority ordering
to prevent starvation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PagedAttentionScheduler priority sorting causes First-Come-Last-Served behavior

1 participant