Skip to content

[OFT] Chebyshev-Optimized Newton-Schulz (CANS): Faster and Better#1512

Open
Koratahiu wants to merge 17 commits into
Nerogar:masterfrom
Koratahiu:cans_oft
Open

[OFT] Chebyshev-Optimized Newton-Schulz (CANS): Faster and Better#1512
Koratahiu wants to merge 17 commits into
Nerogar:masterfrom
Koratahiu:cans_oft

Conversation

@Koratahiu

@Koratahiu Koratahiu commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

The Issue

In OFT, we currently orthogonalize the weights using two different methods:

  • Default: Truncated Cayley-Neumann
  • Exact Solver: Exact matrix math

The exact solver is typically excluded from practical use because it is computationally slow and scales poorly. On the other hand, the Cayley-Neumann method exhibits a relatively high orthogonalization error. While the exact solver achieves an error of around $10^{-6}$, Cayley-Neumann's error ranges between $0.1$ and $0.5$. It is also unstable for matrices with higher norms (as noted in #1492) and converges poorly in those cases. Which makes it scale variant and prone to error.

Standard alternative approximations, such as standard Newton-Schulz, were evaluated but did not resolve these issues.

Our Solution (CANS)

CANS is a variant of the Newton-Schulz (NS) algorithm designed to achieve strict orthogonality.

Note: Standard NS flattens singular values but is not explicitly optimized to reach true orthogonality.

A known limitation of CANS is that it requires tuning a lower bound parameter to converge optimally. However, for the OFT formulation ($I + Q$), we can define this lower bound simply as:
1 / Frobenius norm of I+Q

Using this bound makes CANS highly suitable for OFT.


Convergence compared to current methods

Figurerit10_1

The plot above shows the performance on a random matrix. In this test case, CANS (red) achieves lower orthogonalization error than both Cayley-Neumann and the exact solver.


Block-size invariant & Number of iterations - matmuls required

oft_cans_convergence

It only requires 7 iterations (14 matmuls) to fully converge in FP32.

  • Note: It has 2 matmuls per iteration, unlike standard NS, which has 3 matmuls per iteration. This makes 7 CNS iterations faster than 5 standard iterations.

Test plan

  • pre-commit run --all-files passes
  • Launched the affected UI or script and exercised the change
  • Tested with at least one real preset / config when relevant (note which: ____)

AI assistance

  • No AI involvement

Sources:

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

@Koratahiu

Copy link
Copy Markdown
Contributor Author

This is ready for testing and review.
I tested it myself, and it was the first time I have seen OFT reach the higher norms of 20+ (using CANS + spectral scaling). This pushed it to its limits, which might results in an expected +50% to +80% increase in expressiveness for the same block size.

It is also optimal for DOFT #1335, since the orthogonalization error is very small (1e-6 compared to 0.1-0.01 of Cayley-Neumann, assuming FP32).

@Koratahiu Koratahiu marked this pull request as ready for review June 11, 2026 06:32
@dxqb dxqb added the preview merged in the preview branch label Jun 13, 2026
dxqb added a commit that referenced this pull request Jun 14, 2026
BitcrushedHeart added a commit to BitcrushedHeart/OneTrainer that referenced this pull request Jun 14, 2026
Resolved conflicts with local DoRA-OFT work (oft_clipped_norm /
spectral-norm clipping): kept both features side by side.
- OFTRotationModule: CANS Newton-Schulz iteration added alongside
  power-iteration spectral clipping; oft_cans disables Cayley-Neumann.
- OFTModule/TrainConfig/LoraTab: oft_cans field added next to
  oft_clipped_norm; CANS switch placed at row 5 (row 4 taken by DOFT).
- Conv2d forward, apply_to_module and oft_verify now pass oft_cans
  through to _cayley_batch so merge/verify match training math.
BitcrushedHeart added a commit to BitcrushedHeart/OneTrainer that referenced this pull request Jun 14, 2026
…t max-norm

Supersedes the local bool port of the same PR with its final upstream
form: oft_clipped_norm is now float | None (the max spectral norm
itself, default 0.95, None disables) instead of on/off at a hard-coded
0.999. The clip now applies before the orthogonalization branch (all
methods, including CANS), and the clipped_oft marker buffer is
persistent with the clip value embedded for inference tools.

Local adjustments:
- TrainConfig.from_dict coerces the legacy bool form (True -> 0.999,
  False -> None); float(False) would otherwise clip rotations to zero.
- DoRAOFTModule signature updated to mirror OFTModule (also fixes the
  positional oft_cans arg added by the PR Nerogar#1512 merge, which DOFT did
  not yet accept).
- UI entry placed at row 5 col 0/1; CANS switch stays at row 5 col 3/4.
dxqb added a commit that referenced this pull request Jun 19, 2026
BitcrushedHeart added a commit to BitcrushedHeart/OneTrainer that referenced this pull request Jun 20, 2026
Resolved conflicts with local DoRA-OFT work (oft_clipped_norm /
spectral-norm clipping): kept both features side by side.
- OFTRotationModule: CANS Newton-Schulz iteration added alongside
  power-iteration spectral clipping; oft_cans disables Cayley-Neumann.
- OFTModule/TrainConfig/LoraTab: oft_cans field added next to
  oft_clipped_norm; CANS switch placed at row 5 (row 4 taken by DOFT).
- Conv2d forward, apply_to_module and oft_verify now pass oft_cans
  through to _cayley_batch so merge/verify match training math.
BitcrushedHeart added a commit to BitcrushedHeart/OneTrainer that referenced this pull request Jun 20, 2026
…t max-norm

Supersedes the local bool port of the same PR with its final upstream
form: oft_clipped_norm is now float | None (the max spectral norm
itself, default 0.95, None disables) instead of on/off at a hard-coded
0.999. The clip now applies before the orthogonalization branch (all
methods, including CANS), and the clipped_oft marker buffer is
persistent with the clip value embedded for inference tools.

Local adjustments:
- TrainConfig.from_dict coerces the legacy bool form (True -> 0.999,
  False -> None); float(False) would otherwise clip rotations to zero.
- DoRAOFTModule signature updated to mirror OFTModule (also fixes the
  positional oft_cans arg added by the PR Nerogar#1512 merge, which DOFT did
  not yet accept).
- UI entry placed at row 5 col 0/1; CANS switch stays at row 5 col 3/4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview merged in the preview branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants