Add backward pass to fused layer_norm#3613
Open
NahButch wants to merge 1 commit into
Open
Conversation
candle_nn::ops::layer_norm used apply_op3_no_bwd, so loss.backward() silently returned no gradient for any Var upstream of it, while layer_norm_slow is differentiable. Same naming footgun as rms_norm (huggingface#3526), softmax_last_dim (huggingface#3591), and rope (huggingface#3568). Implements the standard layernorm backward with backend-agnostic tensor ops (computed in f32 for f16/bf16 inputs, matching the forward kernels), returning gradients for x, alpha, and beta. Adds a gradient test comparing the fused path against layer_norm_slow autograd; it fails on the previous behavior with a missing gradient. Fixes huggingface#3011 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
candle_nn::ops::layer_normusedapply_op3_no_bwd, soloss.backward()silently returned no gradient for anyVarupstream of it, whilelayer_norm_slowis differentiable. Same naming footgun as rms_norm (#3526), softmax_last_dim (#3591), and rope (#3568).Implements the standard layernorm backward with backend-agnostic tensor ops (computed in f32 for f16/bf16 inputs, matching the forward kernels), returning gradients for x, alpha, and beta. Adds a gradient test comparing the fused path against
layer_norm_slowautograd; it fails on the previous behavior with a missing gradient.Fixes #3011
🤖 Generated with Claude Code