Skip to content

[OFT] Rework and Matrix Exponential Mode#1556

Draft
Koratahiu wants to merge 120 commits into
Nerogar:masterfrom
Koratahiu:CANS_EXP
Draft

[OFT] Rework and Matrix Exponential Mode#1556
Koratahiu wants to merge 120 commits into
Nerogar:masterfrom
Koratahiu:CANS_EXP

Conversation

@Koratahiu

Copy link
Copy Markdown
Contributor

Summary:

Refactoring

  • Extracted the CANS/Matrix Exponential logic into its own dedicated function.
  • Updated variable names to better reflect the mathematical formulas.
  • Renamed _cayley_batch to _compute_orthogonal_matrix.

Matrix Exponential & Reworking CANS

I found a few issues with the Cayley transform in general (even when using CANS or exact math):

  • Cayley geometry is rigid and highly non-linear (e.g., the step required to go from 140° to 141° is vastly different than the one from 40° to 41°).
  • To achieve an effective learning rate of O(1) across all rotations, one must apply Riemannian preconditioning to the optimizer step (and an additional complex one for weight decay).

After searching for alternatives to Cayley Neumann, I found the Matrix Exponential, which seems to solve all the mentioned issues:

  • It achieves the desired outcome of orthogonalization.
  • It is scale-invariant for the Q norm.
  • It is linear and aligned with the optimizer's flat Euclidean metric (inherently achieving effective learning O(1) across all rotations).
  • It only requires a spectral norm of ~3.14 to reach the maximum rotation, whereas Cayley approaches infinity to reach its maximum rotation.

One way to implement the Matrix Exponential is by using the exact math via torch.linalg.matrix_exp. However, matrix_exp is expensive and very unstable for BF16 (it exploded in my tests).
To resolve this, I applied a highly effective approximation pipeline:

Q Scaling step4th-order Taylor expansionModified CANSRecover Squaring stepOne correction iteration of Newton-Schulz

This might look complex and compute-heavy, but in reality, it only requires 12 matmuls. Thanks to the 4th-order Taylor expansion, the matrix is already near orthogonality, and CANS only requires 3 iterations to converge.

The relative error compared to exact math (torch.linalg.matrix_exp) is very small (~1e-6 to ~1e-3 for FP32 and ~1e-3 for BF16):

CAN_exp

Auto clipping mode

When setting spectral norm clipping to -1, it now automatically applies the recommended spectral norm clipping for each technique:

  • New CANS: ~3.13
  • Truncated Cayley: 0.95

Usage:

  • Enable Matrix Exponential CANS.
  • It is highly recommended to set spectral norm clipping to -1 (auto).

Koratahiu added 30 commits June 8, 2026 15:21
- Dynamic steps based on dtype
- revert scaled oft change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant