Skip to content

[MISC] Add register-only tiled cholesky, and incremental H patching, for performance.#2659

Open
hughperkins wants to merge 67 commits intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/incremental-hessian-fuse-chol-tiles-api
Open

[MISC] Add register-only tiled cholesky, and incremental H patching, for performance.#2659
hughperkins wants to merge 67 commits intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/incremental-hessian-fuse-chol-tiles-api

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Apr 5, 2026

Description

  • upgrade Cholesky full to use 16x16 register-only from Quadrants
    • avoids the high latency of shared memory
    • portable across all Quadrants-supported gpus (amd, metal, vulkan, cuda)
  • use incremental H-patching for Hessian, instead of Givens rotations, or full Hessian
    • faster than full (because incremental)
    • avoids serial restrictions of Givens rotations
  • fuse Cholesky and H-patching
    • avoid pushing to and from global memory in between

Note that we need to combine both incremental H-patching and register tiled Cholesky, because the gains for each on their own are much more modest:

For dex_hand, with field, on RTX 6000 Blackwell:

Approach branch name FPS delta FPS
main main 13686 +0%
tile16 cholesky hp/cholesky-tile16 13813 +0.93%
H-patching hp/incremental-hessian 13858 +1.26%
fused H-patching + tile16 cholesky this PR 15256 +11.47%

This is because fusing both into a single kernel avoids having to make a round-trip to global memory in between them.

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

The Cholesky factorization was writing L back into nt_H, overwriting
the Hessian. The H patching code then added J^T D J deltas to L
instead of H, producing a wrong Hessian and degrading convergence 2x.

Fix: add nt_L field to ConstraintState. Cholesky writes to nt_L,
forward/backward substitution reads from nt_L. nt_H always contains
the Hessian for correct patching.

Result: +4.6% speedup over baseline (14991 vs 14332 FPS on dex_hand).
Replace fixed _N_FULL_HESSIAN_ITERS threshold with adaptive per-env
policy: use H patching when less than half the constraints changed,
full tiled rebuild otherwise. +7.2% on RTX PRO 6000 (13944 vs 13005).
…tiles

Replace the scalar Cholesky-Crout (64 threads, shared memory, N syncs) with
a blocked left-looking algorithm using 16x16 register-resident tiles:

- tile_potrf_16: diagonal factorization via warp shuffles
- tile_trsm_16: triangular solve via warp shuffles
- tile_gemm_sub_16: symmetric GEMM subtract via warp shuffles
- tile_gemm_sub_offdiag_16: off-diagonal GEMM subtract via warp shuffles

All cross-thread communication uses subgroup shuffles. No shared memory
or block synchronization needed. Occupancy limited only by registers
(~32 warps vs ~7 with shared memory), enabling 3x+ throughput improvement
on latency-bound workloads (benchmarked in perso_hugh/demos/cholesky_full.py).

Requires n_dofs to be a multiple of 16 (satisfied by dex_hand n_dofs=64).

Made-with: Cursor
When n_dofs is not a multiple of 16, partial tiles are padded with identity
(diagonal=1, off-diagonal=0) so the factorization produces correct results
for the original n_dofs x n_dofs submatrix. Bounds-checked loads/stores are
extracted into _tile_load_diag_16, _tile_load_offdiag_16, and _tile_store_16.

Fixes test_stickman (humanoid, ~27 DOFs) which was hitting out-of-bounds
accesses with the previous tile16-only implementation.

Made-with: Cursor
….func

Two issues fixed:
1. qd.func cannot accept array/field parameters (AnyArray), so the
   _tile_load_diag_16, _tile_load_offdiag_16, _tile_store_16 helpers
   caused a QuadrantsTypeError. Removed them and inlined all bounds-
   checked loads/stores directly in the kernel body.
2. qd.static cannot reference n_dofs (a Quadrants Expr inside @qd.func),
   so the _NEEDS_PAD compile-time branch didn't work. Removed it and
   always use bounds-checked loads/stores — GPU predication handles the
   common aligned case with negligible overhead.

Made-with: Cursor
Update dependency to quadrants 0.6.0b1 which adds subgroupShuffle
support needed by the tile16 Cholesky. Also fix the gpu_graph -> graph
keyword rename in solver_breakdown.py for API compatibility.

Made-with: Cursor
Preserves H in nt_H for H patching. GEMM lookback reads
from nt_L where prior L tiles were stored.
func_cholesky_and_solve_fused_tiled keeps L entirely in shared memory
during both factorization (for GEMM lookback) and forward/backward
solve. Never writes L to global memory, eliminating the nt_L
write-allocate overhead that limited the hp/incremental-hessian branch
to +1.7%.

Graph loop reordered: gradient computed first (no solve), then fused
kernel reads grad and writes Mgrad. Falls back to separate
Cholesky+solve for non-tiled or CG paths.

RTX PRO 6000 Blackwell, dex_hand 4096 envs:
  clean main:              13,489 FPS (303.7 ms/step)
  tile16 only:             13,722 FPS (298.5 ms/step, +1.7%)
  tile16 + H patch (nt_L): 14,061 FPS (291.3 ms/step, +4.2%)
  fused tile16 + H patch:  16,353 FPS (250.5 ms/step, +21.2%)
The batch path (n_envs > 16384 or n_dofs > 96) never uses H patching,
so there's no need to preserve H in nt_H. Revert to the original
in-place factorization on nt_H to eliminate the H→nt_L copy overhead
that caused 3-8% regressions on 30000-env scenes and box_pyramid_6.

The tiled path continues to use nt_L (or shared memory in the fused
kernel) where H patching is active.
The fused Cholesky+Solve kernel used `if tid < n_dofs` to load the
gradient and store Mgrad, which only handles elements 0..15 with 16
threads. For n_dofs > 16 (e.g. dex_hand n_dofs=27), elements 16+
were uninitialized garbage, causing solver divergence (NaN in
test_stickman) and wrong contact forces.

Fix: use strided loops (`while k < n_dofs: ... k += _CHOL_TILE`)
for both gradient load and Mgrad store.

Also fix test_cholesky_tiling to read the Cholesky factor from the
correct field (nt_L for tiled path, nt_H for batch path).
Remove nt_L entirely. All Cholesky paths (batch, tiled, fused) use
nt_H consistently, matching origin/main's design:

- Init/monolith: Cholesky factorizes in-place on nt_H (L overwrites H)
- Graph loop fused kernel: reads H from nt_H, keeps L in shared
  memory only, writes Mgrad. nt_H is never modified by the fused
  kernel, so H is naturally preserved for patching.
- Backward: reads L from nt_H (unchanged from main)

The first graph-loop iteration always does a full Hessian rebuild
(use_full_hessian=1 set in func_solve_init), so nt_H transitions
from L (after init) back to H before the fused kernel runs.

Saves n_dofs^2 * n_envs * 4 bytes of memory (e.g. 12 MB for
dex_hand 4096 envs).
…irst iter

Two bugs introduced by the nt_L elimination:

1. Init contamination: func_solve_init's Cholesky overwrites nt_H with L.
   On the first graph iteration, envs flagged incremental would patch L
   instead of H.  Fixed by forcing use_full_hessian=1 when iter_count <= 1.

2. Non-fused path incompatibility: H patching was gated on Newton only,
   not Newton+tiled.  For the batch path, Cholesky writes L to nt_H every
   iteration, destroying H for subsequent patches.  Fixed by restricting
   H patching + fused Cholesky+Solve to the tiled path only; non-fused
   path now does full H+Cholesky every iteration (same as origin/main).
The fused Cholesky+Solve kernel keeps L in shared memory and never
writes it to nt_H (which retains H for patching).  The batch path
still writes L to nt_H in-place.  Comparing nt_H between paths now
compares H vs L, which differ by design.

Switch to comparing qacc (solver output) which validates end-to-end
correctness of both Cholesky paths.
The runtime use_full_hessian check in func_hessian_direct_tiled was
compiled into the kernel for all scenes, even those not using H
patching.  This added a global memory read + branch per block,
and likely changed register allocation, causing a -23% regression
on box_pyramid_6 (126 DOFs, batch/non-fused path).

Make it a compile-time gate via check_full_hessian template parameter
(default False).  Only the H patching caller (_func_newton_only_nt_hessian)
passes True; all other call sites get the check eliminated at compile time.
The tiled path now uses H patching + fused Cholesky (a different solver
strategy than the batch path's full H+Cholesky every iteration), so
internal solver state (qacc) can differ beyond fp32 tolerance after
a single step.  Compare robot qpos over 10 steps instead, which
validates equivalent physics output while tolerating internal solver
differences.
More accurate BLAS naming: syr (symmetric rank-1 update) for the
diagonal GEMM subtract, ger (general rank-1 update) for the
off-diagonal GEMM subtract.
Replace 280+ lines of hand-rolled r0..r15 register management and
_tile_{syr_sub,ger_sub,potrf,trsm}_16 functions with Tile16 dataclass
from quadrants.lang.simt.tile16. Both func_cholesky_factor_direct_tiled
and func_cholesky_and_solve_fused_tiled are migrated.

Net: -521 lines, +41 lines. Bump quadrants to 0.6.0b2.
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@hughperkins hughperkins force-pushed the hp/incremental-hessian-fuse-chol-tiles-api branch 5 times, most recently from 710cf0c to b0babb2 Compare April 5, 2026 02:40
@hughperkins hughperkins force-pushed the hp/incremental-hessian-fuse-chol-tiles-api branch from b0babb2 to 1de2b6c Compare April 5, 2026 02:41
Comment thread genesis/engine/solvers/rigid/constraint/solver_breakdown.py Outdated
Comment thread tests/test_rigid_physics.py Outdated
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Note: updated the vector bits to use a VectorProxy too now, similar to the TileProxy:

                        v_own = L_sh[i0:n_dofs, j0 + t]
                        v_diag = L_sh[k0:n_dofs, j0 + t]
                        L_ik -= qd.outer(v_own, v_diag)

@hughperkins
Copy link
Copy Markdown
Collaborator Author

added some commentshere:

        # --- Scalar triangular solve using L from shared memory ---
        # No longer using 16x16 tiles; the 16 threads parallelize each row's
        # dot product by striping across columns, then butterfly-reduce to
        # sum the partial products. Thread 0 writes each solved element.



@qd.func
def _butterfly_reduce_16(val, tid):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be moved into quadrants somewhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probalby right?

Copy link
Copy Markdown
Collaborator Author

@hughperkins hughperkins Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, I kind of lean towards the name descrinbg the effect and behavior, rather than the detailed implementation?)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be much better in Genesis where lower expertise from maintainers can be expected. Nice to mention the implementation in doc string though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you should move it in Quadrants i think. My previous comment still hold since it would be public API.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔴 Benchmark Regression Detected ➡️ Report

…n-fuse-chol-tiles-api

Made-with: Cursor

# Conflicts:
#	pyproject.toml
…16x16

qd.types.Tile16x16 was removed in quadrants hp/tiles-5. Use the new
public API: qd.simt.Tile16x16.zeros(dtype=...) for zero tiles.
Module-level alias `Tile16x16 = qd.simt.Tile16x16` caused the purity
checker to flag `Tile16x16.SIZE` as a global variable access. Instead:
- Use `_TILE = qd.simt.Tile16x16.SIZE` (int constant) for SIZE
- Use `qd.simt.Tile16x16.zeros(dtype=...)` directly in kernels
- Use `eye_()` (now public) instead of `_eye_()`
The purity checker rejects any module-level variable access inside
kernels, including plain int constants like _TILE = 16. Replace all
_TILE references with literal 16 inside kernel functions.
@github-actions
Copy link
Copy Markdown

🔴 Benchmark Regression Detected ➡️ Report

@github-actions
Copy link
Copy Markdown

🔴 Benchmark Regression Detected ➡️ Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants