[OFT] Chebyshev-Optimized Newton-Schulz (CANS): Faster and Better#1512
Open
Koratahiu wants to merge 17 commits into
Open
[OFT] Chebyshev-Optimized Newton-Schulz (CANS): Faster and Better#1512Koratahiu wants to merge 17 commits into
Koratahiu wants to merge 17 commits into
Conversation
- Cast to BF16 - Decrease steps to 5
- torch.bmm() for batched 3d
- Dynamic steps based on dtype
- revert scaled oft change
Contributor
Author
|
This is ready for testing and review. It is also optimal for DOFT #1335, since the orthogonalization error is very small (1e-6 compared to 0.1-0.01 of Cayley-Neumann, assuming FP32). |
BitcrushedHeart
added a commit
to BitcrushedHeart/OneTrainer
that referenced
this pull request
Jun 14, 2026
Resolved conflicts with local DoRA-OFT work (oft_clipped_norm / spectral-norm clipping): kept both features side by side. - OFTRotationModule: CANS Newton-Schulz iteration added alongside power-iteration spectral clipping; oft_cans disables Cayley-Neumann. - OFTModule/TrainConfig/LoraTab: oft_cans field added next to oft_clipped_norm; CANS switch placed at row 5 (row 4 taken by DOFT). - Conv2d forward, apply_to_module and oft_verify now pass oft_cans through to _cayley_batch so merge/verify match training math.
BitcrushedHeart
added a commit
to BitcrushedHeart/OneTrainer
that referenced
this pull request
Jun 14, 2026
…t max-norm Supersedes the local bool port of the same PR with its final upstream form: oft_clipped_norm is now float | None (the max spectral norm itself, default 0.95, None disables) instead of on/off at a hard-coded 0.999. The clip now applies before the orthogonalization branch (all methods, including CANS), and the clipped_oft marker buffer is persistent with the clip value embedded for inference tools. Local adjustments: - TrainConfig.from_dict coerces the legacy bool form (True -> 0.999, False -> None); float(False) would otherwise clip rotations to zero. - DoRAOFTModule signature updated to mirror OFTModule (also fixes the positional oft_cans arg added by the PR Nerogar#1512 merge, which DOFT did not yet accept). - UI entry placed at row 5 col 0/1; CANS switch stays at row 5 col 3/4.
BitcrushedHeart
added a commit
to BitcrushedHeart/OneTrainer
that referenced
this pull request
Jun 20, 2026
Resolved conflicts with local DoRA-OFT work (oft_clipped_norm / spectral-norm clipping): kept both features side by side. - OFTRotationModule: CANS Newton-Schulz iteration added alongside power-iteration spectral clipping; oft_cans disables Cayley-Neumann. - OFTModule/TrainConfig/LoraTab: oft_cans field added next to oft_clipped_norm; CANS switch placed at row 5 (row 4 taken by DOFT). - Conv2d forward, apply_to_module and oft_verify now pass oft_cans through to _cayley_batch so merge/verify match training math.
BitcrushedHeart
added a commit
to BitcrushedHeart/OneTrainer
that referenced
this pull request
Jun 20, 2026
…t max-norm Supersedes the local bool port of the same PR with its final upstream form: oft_clipped_norm is now float | None (the max spectral norm itself, default 0.95, None disables) instead of on/off at a hard-coded 0.999. The clip now applies before the orthogonalization branch (all methods, including CANS), and the clipped_oft marker buffer is persistent with the clip value embedded for inference tools. Local adjustments: - TrainConfig.from_dict coerces the legacy bool form (True -> 0.999, False -> None); float(False) would otherwise clip rotations to zero. - DoRAOFTModule signature updated to mirror OFTModule (also fixes the positional oft_cans arg added by the PR Nerogar#1512 merge, which DOFT did not yet accept). - UI entry placed at row 5 col 0/1; CANS switch stays at row 5 col 3/4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Issue
In OFT, we currently orthogonalize the weights using two different methods:
The exact solver is typically excluded from practical use because it is computationally slow and scales poorly. On the other hand, the Cayley-Neumann method exhibits a relatively high orthogonalization error. While the exact solver achieves an error of around$10^{-6}$ , Cayley-Neumann's error ranges between $0.1$ and $0.5$ . It is also unstable for matrices with higher norms (as noted in #1492) and converges poorly in those cases. Which makes it scale variant and prone to error.
Standard alternative approximations, such as standard Newton-Schulz, were evaluated but did not resolve these issues.
Our Solution (CANS)
CANS is a variant of the Newton-Schulz (NS) algorithm designed to achieve strict orthogonality.
A known limitation of CANS is that it requires tuning a lower bound parameter to converge optimally. However, for the OFT formulation ($I + Q$ ), we can define this lower bound simply as:
1 / Frobenius norm of I+QUsing this bound makes CANS highly suitable for OFT.
Convergence compared to current methods
The plot above shows the performance on a random matrix. In this test case, CANS (red) achieves lower orthogonalization error than both Cayley-Neumann and the exact solver.
Block-size invariant & Number of iterations - matmuls required
It only requires 7 iterations (14 matmuls) to fully converge in FP32.
Test plan
pre-commit run --all-filespassesAI assistance
Sources:
Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials