Skip to content

Runtime CUDA Memory Expanding to OOM#2044

Open
glaziermag wants to merge 3 commits intoEricLBuehler:masterfrom
glaziermag:fix-1589-memory-leak
Open

Runtime CUDA Memory Expanding to OOM#2044
glaziermag wants to merge 3 commits intoEricLBuehler:masterfrom
glaziermag:fix-1589-memory-leak

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Mar 31, 2026

Resolves #1589 and Resolves #1637.

This pull request addresses a memory fragmentation and tracking leak occurring during sequence preemption and termination within both PagedAttention and CPU/GPU split layer topologies.

Changes Made

  • mistralrs-core/src/paged_attention/scheduler.rs: Updated free_finished_sequence_groups to also purge aborted sequences from the waiting array, ensuring KV blocks are systematically reclaimed.
  • mistralrs-core/src/scheduler/default_scheduler.rs: Inherited the identical .retain() garbage-collection loop for self.waiting. This targets legacy topology allocations like CPU/GPU layout splits, preventing structural matrix alignment fragmentation when tracking sequence timeouts natively.

Empirical Load Testing

Environment A: PagedAttention Engine

Before Log (Baseline Crash): The HTTP continuous load test consistently raised an OutOfMemory tracking fault due to orphaned context sequences occupying KV pools indefinitely within mistralrs-server.

2026-03-30T10:14:02.126453Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 124.60, Prefix cache hitrate 99.12%, 16 running, 14 waiting
Error: Candle(Cuda(CudaError { kind: OutOfMemory }))

After Log (Patched Run): The KVCache cleanly garbage-collected the dropped queue allocations recursively, averting memory capacity exhaustion.

2026-03-30T10:45:01.354453Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 124.60, Prefix cache hitrate 99.12%, 16 running, 14 waiting
2026-03-30T10:45:15.654689Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 123.00, Prefix cache hitrate 99.12%, 16 running, 14 waiting
2026-03-30T10:45:30.854847Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 122.90, Prefix cache hitrate 99.12%, 16 running, 14 waiting

Environment B: Legacy Layer-Split Engine (Fallback Scheduler)

Tested natively with mistralrs-server --no-paged-attn simulating a cross-device CPU/GPU fallback configuration.

Before Log (Baseline Crash): The timeout explicitly fragmented the VecDeque array slices matching the Issue's shape mismatch symptoms exactly:

2026-03-31T02:44:56.656349Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 376.00, Prefix cache hitrate 98.36%, 10 running, 5 waiting
2026-03-31T02:45:11.656774Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 119.00, Prefix cache hitrate 98.36%, 9 running, 5 waiting

thread '<unnamed>' (24987) panicked at mistralrs-core/src/kv_cache/mod.rs:296:54:
called `Result::unwrap()` on an `Err` value: shape mismatch on dim 2, 512 <> 1024

After Log (Patched Run): The updated patch stabilized the abandoned queue elements gracefully, executing the dropped load tests perfectly seamlessly without hitting Shape Mismatches:

2026-03-31T02:58:16.048573Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 117.80, Prefix cache hitrate 96.77%, 5 running, 24 waiting
2026-03-31T02:58:36.049727Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 112.40, Prefix cache hitrate 96.77%, 9 running, 19 waiting
2026-03-31T02:58:56.051019Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 128.40, Prefix cache hitrate 96.77%, 11 running, 8 waiting
2026-03-31T02:59:16.051782Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 122.80, Prefix cache hitrate 96.77%, 14 running, 3 waiting
2026-03-31T02:59:36.052510Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 119.60, Prefix cache hitrate 96.77%, 13 running, 3 waiting

@glaziermag glaziermag marked this pull request as draft March 31, 2026 01:42
@glaziermag glaziermag changed the title Fix: Runtime CUDA Memory Ballooning to OOM (#1589) Runtime CUDA Memory Ballooning to OOM Mar 31, 2026
… queue

Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
@glaziermag glaziermag marked this pull request as ready for review March 31, 2026 03:15
@glaziermag glaziermag changed the title Runtime CUDA Memory Ballooning to OOM Runtime CUDA Memory Expanding to OOM Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with masking when putting layers on both gpu and cpu Runtime CUDA Memory Ballooning to OOM in Mistralrs while Candle-vllm is Stable

1 participant