Add backward pass to fused rope kernels#3612
Open
NahButch wants to merge 1 commit into
Open
Conversation
rope, rope_i, and rope_thd used apply_op3_no_bwd, so loss.backward() silently returned no gradient for any Var upstream of a rotary embedding, while the rope_*_slow compositions are differentiable. Same naming footgun as rms_norm (huggingface#3526) and softmax_last_dim (huggingface#3591). The rotation is linear in xs, so the backward is the same fused rope applied to the incoming gradient with sin negated; cos/sin tables get no gradient. Adds a gradient test comparing the fused path against the slow-path autograd for all three variants; it fails on the previous behavior with 'no gradient for rope input'. Fixes huggingface#3568 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
rope,rope_i, andrope_thdusedapply_op3_no_bwd, soloss.backward()silently returned no gradient for anyVarupstream of a rotary embedding, while therope_*_slowcompositions are differentiable. Same naming footgun as rms_norm (#3526) and softmax_last_dim (#3591).The rotation is linear in xs, so the backward is the same fused rope applied to the incoming gradient with sin negated — reusing the fast kernels on every backend; the cos/sin tables get no gradient. Adds a gradient test comparing the fused path against slow-path autograd for all three variants; it fails on the previous behavior with 'no gradient for rope input'.
Fixes #3568
🤖 Generated with Claude Code