Skip to content

Fix batch>1 causal masks in GLM4 and Qwen3 model variants#3611

Open
NahButch wants to merge 1 commit into
huggingface:mainfrom
NahButch:causal-mask-batch-sweep
Open

Fix batch>1 causal masks in GLM4 and Qwen3 model variants#3611
NahButch wants to merge 1 commit into
huggingface:mainfrom
NahButch:causal-mask-batch-sweep

Conversation

@NahButch

Copy link
Copy Markdown

Five models (glm4_new, quantized_glm4, quantized_qwen3, qwen3_moe, quantized_qwen3_moe) carried byte-identical copies of qwen3's causal_mask helper, sharing its bug: the mask data is built independent of the batch size but declared with shape (b, 1, tgt, tgt + offset), so only batch row 0 is valid and rows >= 1 attend over a corrupt pattern. Same bug as #3582, fixed for qwen3 in #3610.

Replaces the copies with a shared utils::additive_causal_mask that builds the data once and expands it across the batch (zero-copy view), with unit tests covering batching, KV offset, and sliding window. Net -26 lines.

🤖 Generated with Claude Code

Five models (glm4_new, quantized_glm4, quantized_qwen3, qwen3_moe,
quantized_qwen3_moe) carried byte-identical copies of the causal_mask
helper from qwen3, sharing its bug: the mask data is built independent
of the batch size but declared with shape (b, 1, tgt, tgt + offset), so
only batch row 0 is valid and rows >= 1 attend over a corrupt pattern.

Replace the copies with a shared utils::additive_causal_mask that
builds the data once and expands it across the batch (zero-copy view),
with unit tests covering batching, KV offset, and sliding window.

Same bug as huggingface#3582 (fixed for qwen3 in huggingface#3610).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant