[OFT] Rework and Matrix Exponential Mode by Koratahiu · Pull Request #1556 · Nerogar/OneTrainer

Koratahiu · 2026-06-26T22:46:58Z

Summary:

Refactored the OFT orthogonalization path.
Reworked [OFT] Chebyshev-Optimized Newton-Schulz (CANS): Faster and Better #1512: Renamed Accelerated Newton-Schulz to Matrix Exponential CANS.
Built upon Clipped OFT Norm for Long-term Stability #1492 to support auto mode (when set to -1).
Built upon [advoptm] New Features: Scaled Optimizers, Centered WD, Factored 2nd Moment & More #1344 to support the new CANS.

Refactoring

Extracted the CANS/Matrix Exponential logic into its own dedicated function.
Updated variable names to better reflect the mathematical formulas.
Renamed _cayley_batch to _compute_orthogonal_matrix.

Matrix Exponential & Reworking CANS

I found a few issues with the Cayley transform in general (even when using CANS or exact math):

Cayley geometry is rigid and highly non-linear (e.g., the step required to go from 140° to 141° is vastly different than the one from 40° to 41°).
To achieve an effective learning rate of O(1) across all rotations, one must apply Riemannian preconditioning to the optimizer step (and an additional complex one for weight decay).

After searching for alternatives to Cayley Neumann, I found the Matrix Exponential, which seems to solve all the mentioned issues:

It achieves the desired outcome of orthogonalization.
It is scale-invariant for the Q norm.
It is linear and aligned with the optimizer's flat Euclidean metric (inherently achieving effective learning O(1) across all rotations).
It only requires a spectral norm of ~3.14 to reach the maximum rotation, whereas Cayley approaches infinity to reach its maximum rotation.

One way to implement the Matrix Exponential is by using the exact math via torch.linalg.matrix_exp. However, matrix_exp is expensive and very unstable for BF16 (it exploded in my tests).
To resolve this, I applied a highly effective approximation pipeline:

Q Scaling step ➔ 4th-order Taylor expansion ➔ Modified CANS ➔ Recover Squaring step ➔ One correction iteration of Newton-Schulz

This might look complex and compute-heavy, but in reality, it only requires 12 matmuls. Thanks to the 4th-order Taylor expansion, the matrix is already near orthogonality, and CANS only requires 3 iterations to converge.

The relative error compared to exact math (torch.linalg.matrix_exp) is very small (~1e-6 to ~1e-3 for FP32 and ~1e-3 for BF16):

Auto clipping mode

When setting spectral norm clipping to -1, it now automatically applies the recommended spectral norm clipping for each technique:

New CANS: ~3.13
Truncated Cayley: 0.95

Usage:

Enable Matrix Exponential CANS.
It is highly recommended to set spectral norm clipping to -1 (auto).

…N_MUON

…caled_optm

- Dynamic steps based on dtype

- revert scaled oft change

- Remove 0.999 - Default 0.95

…to cans_oft

…into scaled_optm

@BitcrushedHeart

part of @BitcrushedHeart referenced commit

…into scaled_optm

…to cans_oft

…into CANS_EXP

Koratahiu added 30 commits January 19, 2026 00:11

initial

e432c14

dev1

b359a15

dev2

972ee77

dev3

1d49175

dev4

6ca6f52

add Chroma residual filter

e40579a

stable 2.2 and edit rms tooltip

2bc6ae1

remove the print

31287b2

use .values()

44cca26

Merge branch 'master' of https://github.com/Nerogar/OneTrainer into S…

14814d7

…N_MUON

initial

bee4b86

initial cwd, signed

1ebe54d

dev1

30f7b28

add factored_2nd

d72c03d

pre-commit

69f2417

fix CenteredWDMode

06c9e6e

maybe fix

6d62373

dev2

32ef49a

pre-commit

852f389

fix and remove CenteredWDMode enum

add186d

dev4

8e55171

Dev5: Add Fisher WD to Adam-variants

d15d915

depth_calculator

b9ece34

remove calculate_muon_n_layers

89ce64f

dev6: add scaled eps

fd669d3

add StatePrecision

7db66b5

add SGD_ADV, various changes

2499da1

dev8

535d206

fix

16eef8f

add nesterov

753d265

Koratahiu added 30 commits June 8, 2026 15:21

Merge branch 'master' of https://github.com/Nerogar/OneTrainer into s…

3a5cfaa

…caled_optm

- Remove hardcoded BF16

f2cdc33

- Dynamic steps based on dtype

- Double the rotation to align

cc3fc18

- revert scaled oft change

- remove cayley_neumann guards and accept > 1 values

6e93347

- Remove 0.999 - Default 0.95

2.5.2: Improved Spectral scaling for OFT

895e908

pre-commit

4099562

pre-commit

28337ce

Merge branch 'cans_oft' of https://github.com/Koratahiu/OneTrainer in…

a758e95

…to cans_oft

rename to R_half

e29d0f6

pre-commit

49f7cb6

Add detach to the norm

6c195b9

Remove the clamp

3b66998

2.5.3 Improve OFT spectral scaling

6d4f25f

2.5.4: Small SignSGD bugfix and improvement

db4e48e

fix state_precision to auto

984dc35

Merge branch 'scaled_optm' of https://github.com/Koratahiu/OneTrainer …

4af2490

…into scaled_optm

Fix invalid orthogonal_gradient and state_precision

4329c98

part of @BitcrushedHeart referenced commit

2.5.5: Muon Variants Bugfixes

9d7e858

version bump bugfixes

39f7ee2

version bump

68410a8

Remove from_dict and add __migration_10

266199a

Merge branch 'scaled_optm' of https://github.com/Koratahiu/OneTrainer …

3e7a759

…into scaled_optm

Revert and add sub-config migration

43c6560

2.5.9: Enhance compiled optimizer mode

885031c

Reorder the squaring to avoid noise amplification

7d415a4

Merge branch 'cans_oft' of https://github.com/Koratahiu/OneTrainer in…

7b9acd2

…to cans_oft

Merge branch 'clipped_oft' of https://github.com/Koratahiu/OneTrainer …

0c2249f

…into CANS_EXP

Merge branch 'scaled_optm' of https://github.com/Koratahiu/OneTrainer …

bbf2415

…into CANS_EXP

initial

298bb2d

pre-commit

4d77dcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OFT] Rework and Matrix Exponential Mode#1556

[OFT] Rework and Matrix Exponential Mode#1556
Koratahiu wants to merge 120 commits into
Nerogar:masterfrom
Koratahiu:CANS_EXP

Koratahiu commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Koratahiu commented Jun 26, 2026

Summary:

Refactoring

Matrix Exponential & Reworking CANS

Auto clipping mode

Usage:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant