Skip to content

fix(paged-attn): resolve scheduler queue loop deadlock under memory pressure#2043

Open
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-paged-attention-deadlock-1470
Open

fix(paged-attn): resolve scheduler queue loop deadlock under memory pressure#2043
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-paged-attention-deadlock-1470

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Mar 31, 2026

Description

This PR addresses a deadlock issue in PagedAttentionScheduler where the inference engine could hang during preemption under heavy KV cache memory pressure (resolves #1470).

Previously, when a sequence was preempted in the schedule method via _preempt(), it was pushed to the front of the self.waiting queue. Because the main starvation loop iterated over the queue non-destructively using self.waiting.front(), the newly preempted sequence was immediately targeted for re-evaluation. This caused an infinite loop of preemption and failed allocations.

Changes

  • Refactored the waiting sequence queue iteration in scheduler.rs to proactively pop_front() the sequence before checking its allocation status.
  • If the scheduling loop must break early (due to max_num_seqs caps or unresolvable starvation conditions requiring a later retry), the sequence is safely re-inserted via self.waiting.push_front(seq).

Testing

  • Reproduced the deadlock condition under a heavy simulated workload (30 concurrent requests) on a g2-standard-32 instance.
  • Verified that the patched logic properly filters preempted sequences to the next pipeline tick, maintaining stable throughput without triggering the infinite WARN log cycle.

Blast Radius Verification

To verify the structural integrity of the scheduler state, we ran a concurrent barrage of large Qwen2.5-3B completions on a strictly bounded 4096-token KV cache environment specifically designed to exhaust the BlockPool.

Before (Unpatched):
The engine immediately hit the KV cache bottleneck and permanently hung at 0.00 T/s as it recursively thrashed the preemption loop safely catching AllocStatus::Later:

2026-03-30T23:52:01.962646Z  WARN mistralrs_core::paged_attention::scheduler: Sequence 5 with length of 351 tokens still exceeds KV cache size even after evicting another sequence.
2026-03-30T23:52:01.962654Z  WARN mistralrs_core::paged_attention::scheduler: Sequence 5 with length of 351 tokens still exceeds KV cache size even after evicting another sequence.
... <Infinitely spammed leading to 42MB deadlocked dump> ...

After (Patched):
Under identical aiohttp saturation tests, the sequence correctly stepped down onto the waiting queue and advanced the scheduler, letting independent completions securely evaluate until block space was liberated. The load successfully finished while keeping interval processing cleanly between 1.6K ~ 2.0K T/s:

2026-03-31T00:05:18.231240Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 1328.40, Prefix cache hitrate 0.00%, 4 running, 26 waiting
2026-03-31T00:05:23.231414Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 1558.40, Prefix cache hitrate 0.00%, 15 running, 15 waiting
2026-03-31T00:05:38.231924Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 1848.60, Prefix cache hitrate 0.00%, 15 running, 15 waiting
... <Gracefully finishes execution> ...

@glaziermag glaziermag marked this pull request as ready for review March 31, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Caching Logic Appears to Cause Model Hangs

1 participant