Fix inf grad_norm on Qwen3.5 at seq_len > 65536 with tighter SDPA guards by danielhanchen · Pull Request #587 · unslothai/unsloth-zoo

danielhanchen · 2026-04-09T15:06:27Z

Summary

Fixes #4906: LoRA training of Qwen3.5-4B/9B produces grad_norm=inf at seq_len > 65536 when flash-attn is not installed
Wraps sdpa_attention_forward to detect materialized bool causal masks and replace with attention_mask=None, is_causal=True
Includes tighter guards than Fix inf grad_norm on Qwen3.5 at seq_len > 65536 without flash-attn #582 to protect sliding-window, bidirectional, and packed-sequence masks
Adds bool-to-float fallback for non-pure-causal masks to avoid the Cutlass bug while preserving mask semantics

Root cause

patch_transformers_masks wraps create_causal_mask with torch.compile(dynamic=True). Under tracing, find_packed_sequence_indices takes the is_tracing branch, forcing allow_is_causal_skip=False. This materializes a dense [1,1,Q,K] bool causal mask. PyTorch SDPA's Cutlass backend with bf16 + bool mask at seq_len > 65536 produces wrong outputs/gradients (int16 sequence-index overflow at 2^16).

Guards (all O(1), no CUDA ops)

module.is_causal check -- protects BERT/bidirectional encoders
sliding_window kwarg check -- protects Gemma2/Mistral/Qwen2/3
Dict-mask unwrap with safe fallback when layer_type not in dict
4D bool + square + Q==K shape check -- not cross-attention or kv-cache decode
Upper-triangle spot-check m[0,0,0,1]==False -- distinguishes pure causal from packed-sequence masks
Bool-to-float additive bias fallback for non-pure-causal masks -- avoids Cutlass bug while preserving mask semantics

Test results (NVIDIA B200, torch 2.9.1, transformers 5.5.1, no flash-attn)

Qwen3.5-4B at seq_len=69632, 4-bit LoRA, 9 steps:

Method	Time	Peak Mem	Grad Norms [1st, 5th, 9th]
Without fix	3649s	74.27 GB	[inf, nan, nan]
With this fix	2765s	74.27 GB	[3.319, 3.219, 1.057]

Llama-3.2-1B regression at seq_len=2048, 21 steps:

Max loss delta vs baseline: 0.003
Max grad_norm delta vs baseline: 0.06
No regression

Test plan

Reproduce grad_norm=inf at seq_len=69632 without fix
Verify finite grad_norms with fix at seq_len=69632
Llama short-context regression (max loss delta < 0.01)
Sliding-window model test (Gemma2/Mistral)
Packed-sequence training test

When patch_transformers_masks wraps create_causal_mask with torch.compile(dynamic=True), the is_tracing branch in find_packed_sequence_indices materializes a dense [1,1,Q,K] bool causal mask. PyTorch SDPA's Cutlass backend with bf16 + bool mask at seq_len > 65536 produces wrong outputs and gradients (int16 overflow). Wrap sdpa_attention_forward to detect materialized bool causal masks and replace with attention_mask=None, is_causal=True. Unlike the original approach, this version includes tighter guards: - Check module.is_causal to protect BERT/bidirectional encoders - Check sliding_window kwarg to protect Gemma2/Mistral/Qwen2 - Spot-check upper triangle to distinguish pure causal from packed masks - Convert non-pure-causal bool masks to float additive bias as fallback Fixes unslothai/unsloth#4906

unsloth_zoo/temporary_patches/misc.py

 TEMPORARY_PATCHES.append(patch_transformers_masks)


+def patch_sdpa_bool_causal_mask():


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37d5da764e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T15:11:51Z

unsloth_zoo/temporary_patches/misc.py

+        # (first query cannot see second key). Packed-sequence masks and other
+        # non-trivial patterns will have True in the upper triangle.
+        S = m.shape[-1]
+        is_pure_causal = (S < 2) or (not m[0, 0, 0, 1].item())


Do not classify packed masks via a single upper-triangle check

is_pure_causal is inferred from only m[0,0,0,1], but packed-sequence and padding-constrained causal masks also have this entry set to False due to the base causal rule (kv_idx > q_idx is always masked). In those cases this branch misclassifies the mask as “pure causal” and drops it, so SDPA runs with attention_mask=None, is_causal=True and loses sequence/padding boundaries. That can silently let tokens attend across packed samples (or into padded tokens) and corrupt training/inference whenever packed or left-padded batches are used.

Useful? React with 👍 / 👎.

gemini-code-assist · 2026-04-09T15:13:32Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

The single-cell check m[0,0,0,1]==False is insufficient: packed masks like [[1,0,0,0],[1,1,0,0],[0,0,1,0],[0,0,1,1]] also satisfy it. Add m[0,0,-1,0]==True (last query can see first key) which is true for pure lower-triangular causal masks but false for packed-sequence masks, padded masks, and sliding-window masks. Both checks are O(1).

danielhanchen · 2026-04-09T16:02:36Z

Updated with a second commit to address the packed-sequence mask regression identified by review.

What changed: The single-cell spot-check m[0,0,0,1]==False was insufficient -- packed masks like [[1,0,0,0],[1,1,0,0],[0,0,1,0],[0,0,1,1]] also satisfy it. Added a second O(1) check: m[0,0,-1,0]==True (last query can see first key), which correctly rejects packed, padded, and sliding-window masks while accepting pure lower-triangular causal masks.

Verification:

Unit test confirms all 5 mask types are correctly classified:
- Pure causal: accepted (is_causal=True fast path)
- Packed sequence: rejected (falls through to float fallback)
- Right-padded: rejected (falls through to float fallback)
- Left-padded: rejected (falls through to float fallback)
- Sliding window: rejected (falls through to float fallback)
GPU test: Qwen3.5-4B at seq_len=69632, 9 steps -- all grad_norms finite (3.319, 3.320, 3.012, 2.932, 3.209, 3.671, 2.848, 1.612, 1.058)

unsloth_zoo/temporary_patches/misc.py

+    sdpa_attention_forward_unsloth.__unsloth_bool_causal_mask_fix__ = True
+    sdpa_mod.sdpa_attention_forward = sdpa_attention_forward_unsloth
+    ALL_ATTENTION_FUNCTIONS["sdpa"] = sdpa_attention_forward_unsloth
+pass


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f54a06b00b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T16:10:03Z

unsloth_zoo/temporary_patches/misc.py

+        # Convert bool -> float additive bias to avoid the Cutlass bool-mask
+        # bug while preserving mask semantics. SDPA dispatches to a different
+        # (working) kernel for float attn_mask inputs.
+        m_float = torch.where(m, 0.0, torch.finfo(query.dtype).min).to(query.dtype)


Use -inf when converting bool masks to additive bias

The bool→float fallback uses torch.finfo(query.dtype).min, which is finite, so rows that are fully masked (a common case for left-padded causal masks) no longer behave like the original boolean mask. In SDPA, an all-finite masked row becomes an all-equal logit row and produces a uniform attention mix instead of zero output, so this path does not preserve mask semantics for padded/custom masks and can change hidden states whenever is_pure_causal is false.

Useful? React with 👍 / 👎.

github-code-quality bot found potential problems Apr 9, 2026

View reviewed changes

unsloth_zoo/temporary_patches/misc.py Fixed Show fixed Hide fixed

unsloth_zoo/temporary_patches/misc.py

TEMPORARY_PATCHES.append(patch_transformers_masks)

def patch_sdpa_bool_causal_mask():

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

github-code-quality bot found potential problems Apr 9, 2026

View reviewed changes

unsloth_zoo/temporary_patches/misc.py

sdpa_attention_forward_unsloth.__unsloth_bool_causal_mask_fix__ = True

sdpa_mod.sdpa_attention_forward = sdpa_attention_forward_unsloth

ALL_ATTENTION_FUNCTIONS["sdpa"] = sdpa_attention_forward_unsloth

pass

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inf grad_norm on Qwen3.5 at seq_len > 65536 with tighter SDPA guards#587

Fix inf grad_norm on Qwen3.5 at seq_len > 65536 with tighter SDPA guards#587
danielhanchen wants to merge 2 commits intomainfrom
fix/issue-4906-sdpa-bool-mask-tighter-guards

danielhanchen commented Apr 9, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		TEMPORARY_PATCHES.append(patch_transformers_masks)


		def patch_sdpa_bool_causal_mask():

Conversation

danielhanchen commented Apr 9, 2026

Summary

Root cause

Guards (all O(1), no CUDA ops)

Test results (NVIDIA B200, torch 2.9.1, transformers 5.5.1, no flash-attn)

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant