Runtime CUDA Memory Expanding to OOM by glaziermag · Pull Request #2044 · EricLBuehler/mistral.rs

glaziermag · 2026-03-31T01:40:13Z

Resolves #1589 and Resolves #1637.

This pull request addresses a memory fragmentation and tracking leak occurring during sequence preemption and termination within both PagedAttention and CPU/GPU split layer topologies.

Changes Made

mistralrs-core/src/paged_attention/scheduler.rs: Updated free_finished_sequence_groups to also purge aborted sequences from the waiting array, ensuring KV blocks are systematically reclaimed.
mistralrs-core/src/scheduler/default_scheduler.rs: Inherited the identical .retain() garbage-collection loop for self.waiting. This targets legacy topology allocations like CPU/GPU layout splits, preventing structural matrix alignment fragmentation when tracking sequence timeouts natively.

Empirical Load Testing

Environment A: PagedAttention Engine

Before Log (Baseline Crash): The HTTP continuous load test consistently raised an OutOfMemory tracking fault due to orphaned context sequences occupying KV pools indefinitely within mistralrs-server.

2026-03-30T10:14:02.126453Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 124.60, Prefix cache hitrate 99.12%, 16 running, 14 waiting
Error: Candle(Cuda(CudaError { kind: OutOfMemory }))

After Log (Patched Run): The KVCache cleanly garbage-collected the dropped queue allocations recursively, averting memory capacity exhaustion.

2026-03-30T10:45:01.354453Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 124.60, Prefix cache hitrate 99.12%, 16 running, 14 waiting
2026-03-30T10:45:15.654689Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 123.00, Prefix cache hitrate 99.12%, 16 running, 14 waiting
2026-03-30T10:45:30.854847Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 122.90, Prefix cache hitrate 99.12%, 16 running, 14 waiting

Environment B: Legacy Layer-Split Engine (Fallback Scheduler)

Tested natively with mistralrs-server --no-paged-attn simulating a cross-device CPU/GPU fallback configuration.

Before Log (Baseline Crash): The timeout explicitly fragmented the VecDeque array slices matching the Issue's shape mismatch symptoms exactly:

2026-03-31T02:44:56.656349Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 376.00, Prefix cache hitrate 98.36%, 10 running, 5 waiting
2026-03-31T02:45:11.656774Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 119.00, Prefix cache hitrate 98.36%, 9 running, 5 waiting

thread '<unnamed>' (24987) panicked at mistralrs-core/src/kv_cache/mod.rs:296:54:
called `Result::unwrap()` on an `Err` value: shape mismatch on dim 2, 512 <> 1024

After Log (Patched Run): The updated patch stabilized the abandoned queue elements gracefully, executing the dropped load tests perfectly seamlessly without hitting Shape Mismatches:

2026-03-31T02:58:16.048573Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 117.80, Prefix cache hitrate 96.77%, 5 running, 24 waiting
2026-03-31T02:58:36.049727Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 112.40, Prefix cache hitrate 96.77%, 9 running, 19 waiting
2026-03-31T02:58:56.051019Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 128.40, Prefix cache hitrate 96.77%, 11 running, 8 waiting
2026-03-31T02:59:16.051782Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 122.80, Prefix cache hitrate 96.77%, 14 running, 3 waiting
2026-03-31T02:59:36.052510Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 119.60, Prefix cache hitrate 96.77%, 13 running, 3 waiting

…ressure

…iting queue

… queue Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>

glaziermag added 2 commits March 30, 2026 17:13

fix(paged-attn): resolve scheduler queue loop deadlock under memory p…

1d15838

…ressure

fix(scheduler): prevent VRAM leak by dropping aborted sequences in wa…

ae3de9e

…iting queue

glaziermag marked this pull request as draft March 31, 2026 01:42

glaziermag changed the title ~~Fix: Runtime CUDA Memory Ballooning to OOM (#1589)~~ Runtime CUDA Memory Ballooning to OOM Mar 31, 2026

fix(scheduler): purge aborted sequences from DefaultScheduler waiting…

92706b1

… queue Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>

glaziermag marked this pull request as ready for review March 31, 2026 03:15

glaziermag changed the title ~~Runtime CUDA Memory Ballooning to OOM~~ Runtime CUDA Memory Expanding to OOM Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime CUDA Memory Expanding to OOM#2044

Runtime CUDA Memory Expanding to OOM#2044
glaziermag wants to merge 3 commits intoEricLBuehler:masterfrom
glaziermag:fix-1589-memory-leak

glaziermag commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glaziermag commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Empirical Load Testing

Environment A: PagedAttention Engine

Environment B: Legacy Layer-Split Engine (Fallback Scheduler)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glaziermag commented Mar 31, 2026 •

edited

Loading