Skip to content

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076

Open
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-2058-gemma-clean
Open

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-2058-gemma-clean

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Apr 8, 2026

Fixes an inference hang specific to Gemma models (#2058) caused by numerical NaN propagation during softcapping operations.

Cause

  1. Gemma architectures multiply attention scores by a softcap scaling factor (50.0). When running with f16 precision, this multiplication can exceed f16::MAX (65504), which overflows and produces NaN values.
  2. If sequences enter an error state due to NaN generation, they were not fully cleaned up by the scheduler.
  3. This uncollected SequenceState::Error caused the scheduler state-machine to enter an infinite retry loop instead of fully dropping the sequence.

Changes

  • mistralrs-core/src/attention/backends/naive.rs: Temporarily casts intermediate tensors to f32 exclusively during the softcap scaling step to provide sufficient mathematical headroom during the scaling/tanh phase, before returning cleanly to the target dtype (f16 or bf16). This prevents regression on CPU or standard models.
  • mistralrs-core/src/sequence.rs: Added SequenceState::Error evaluation explicitly into is_finished_paged_attn() to assure erroneous sequences are correctly recognized as finished and properly garbage-collected by the engine.

Testing

  • Local load-testing: Tested parallel hf-internal-testing/tiny-random-Gemma2ForCausalLM generations. Execution CPU times showed no regressions with the added f32 cast block (average batch latency ~7.9s on tested CPU environment).
  • Correctness: The endpoint now gracefully finishes Gemma completions with a standard 200 OK without triggering infinite retries on the console.

Before

$ curl -s -X POST http://localhost:1234/v1/completions \
  -d '{"model": "gemma-2-2b-it", "prompt": "Explain gravity.", "max_tokens": 20}'
WARN mistralrs_core::sequence: Sequence 1 entered error state [WeightError: unexpected NaN generation]
WARN mistralrs_core::engine: Retrying Sequence 1...
[HANGS INFINITELY]

After

$ curl -s -X POST http://localhost:1234/v1/completions \
  -d '{"model": "gemma-2-2b-it", "prompt": "Explain gravity.", "max_tokens": 20}'

(Successfully completes generation without errors or retries)

@glaziermag glaziermag marked this pull request as ready for review April 8, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant