[MISC] Add register-only tiled cholesky, and incremental H patching, for performance. by hughperkins · Pull Request #2659 · Genesis-Embodied-AI/Genesis

hughperkins · 2026-04-05T02:16:33Z

Description

upgrade Cholesky full to use 16x16 register-only from Quadrants
- avoids the high latency of shared memory
- portable across all Quadrants-supported gpus (amd, metal, vulkan, cuda)
use incremental H-patching for Hessian, instead of Givens rotations, or full Hessian
- faster than full (because incremental)
- avoids serial restrictions of Givens rotations
fuse Cholesky and H-patching
- avoid pushing to and from global memory in between

Note that we need to combine both incremental H-patching and register tiled Cholesky, because the gains for each on their own are much more modest:

For dex_hand, with field, on RTX 6000 Blackwell:

Approach	branch name	FPS	delta FPS
main	main	13686	+0%
tile16 cholesky	hp/cholesky-tile16	13813	+0.93%
H-patching	hp/incremental-hessian	13858	+1.26%
fused H-patching + tile16 cholesky	this PR	15256	+11.47%

This is because fusing both into a single kernel avoids having to make a round-trip to global memory in between them.

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

I read the CONTRIBUTING document.
I followed the Submitting Code Changes section of CONTRIBUTING document.
I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
I updated the documentation accordingly or no change is needed.
I tested my changes and added instructions on how to test it for reviewers.

I have added tests to cover my changes.
All new and existing tests passed.

The Cholesky factorization was writing L back into nt_H, overwriting the Hessian. The H patching code then added J^T D J deltas to L instead of H, producing a wrong Hessian and degrading convergence 2x. Fix: add nt_L field to ConstraintState. Cholesky writes to nt_L, forward/backward substitution reads from nt_L. nt_H always contains the Hessian for correct patching. Result: +4.6% speedup over baseline (14991 vs 14332 FPS on dex_hand).

Replace fixed _N_FULL_HESSIAN_ITERS threshold with adaptive per-env policy: use H patching when less than half the constraints changed, full tiled rebuild otherwise. +7.2% on RTX PRO 6000 (13944 vs 13005).

…tiles Replace the scalar Cholesky-Crout (64 threads, shared memory, N syncs) with a blocked left-looking algorithm using 16x16 register-resident tiles: - tile_potrf_16: diagonal factorization via warp shuffles - tile_trsm_16: triangular solve via warp shuffles - tile_gemm_sub_16: symmetric GEMM subtract via warp shuffles - tile_gemm_sub_offdiag_16: off-diagonal GEMM subtract via warp shuffles All cross-thread communication uses subgroup shuffles. No shared memory or block synchronization needed. Occupancy limited only by registers (~32 warps vs ~7 with shared memory), enabling 3x+ throughput improvement on latency-bound workloads (benchmarked in perso_hugh/demos/cholesky_full.py). Requires n_dofs to be a multiple of 16 (satisfied by dex_hand n_dofs=64). Made-with: Cursor

When n_dofs is not a multiple of 16, partial tiles are padded with identity (diagonal=1, off-diagonal=0) so the factorization produces correct results for the original n_dofs x n_dofs submatrix. Bounds-checked loads/stores are extracted into _tile_load_diag_16, _tile_load_offdiag_16, and _tile_store_16. Fixes test_stickman (humanoid, ~27 DOFs) which was hitting out-of-bounds accesses with the previous tile16-only implementation. Made-with: Cursor

….func Two issues fixed: 1. qd.func cannot accept array/field parameters (AnyArray), so the _tile_load_diag_16, _tile_load_offdiag_16, _tile_store_16 helpers caused a QuadrantsTypeError. Removed them and inlined all bounds- checked loads/stores directly in the kernel body. 2. qd.static cannot reference n_dofs (a Quadrants Expr inside @qd.func), so the _NEEDS_PAD compile-time branch didn't work. Removed it and always use bounds-checked loads/stores — GPU predication handles the common aligned case with negligible overhead. Made-with: Cursor

Update dependency to quadrants 0.6.0b1 which adds subgroupShuffle support needed by the tile16 Cholesky. Also fix the gpu_graph -> graph keyword rename in solver_breakdown.py for API compatibility. Made-with: Cursor

Preserves H in nt_H for H patching. GEMM lookback reads from nt_L where prior L tiles were stored.

func_cholesky_and_solve_fused_tiled keeps L entirely in shared memory during both factorization (for GEMM lookback) and forward/backward solve. Never writes L to global memory, eliminating the nt_L write-allocate overhead that limited the hp/incremental-hessian branch to +1.7%. Graph loop reordered: gradient computed first (no solve), then fused kernel reads grad and writes Mgrad. Falls back to separate Cholesky+solve for non-tiled or CG paths. RTX PRO 6000 Blackwell, dex_hand 4096 envs: clean main: 13,489 FPS (303.7 ms/step) tile16 only: 13,722 FPS (298.5 ms/step, +1.7%) tile16 + H patch (nt_L): 14,061 FPS (291.3 ms/step, +4.2%) fused tile16 + H patch: 16,353 FPS (250.5 ms/step, +21.2%)

The batch path (n_envs > 16384 or n_dofs > 96) never uses H patching, so there's no need to preserve H in nt_H. Revert to the original in-place factorization on nt_H to eliminate the H→nt_L copy overhead that caused 3-8% regressions on 30000-env scenes and box_pyramid_6. The tiled path continues to use nt_L (or shared memory in the fused kernel) where H patching is active.

The fused Cholesky+Solve kernel used `if tid < n_dofs` to load the gradient and store Mgrad, which only handles elements 0..15 with 16 threads. For n_dofs > 16 (e.g. dex_hand n_dofs=27), elements 16+ were uninitialized garbage, causing solver divergence (NaN in test_stickman) and wrong contact forces. Fix: use strided loops (`while k < n_dofs: ... k += _CHOL_TILE`) for both gradient load and Mgrad store. Also fix test_cholesky_tiling to read the Cholesky factor from the correct field (nt_L for tiled path, nt_H for batch path).

Remove nt_L entirely. All Cholesky paths (batch, tiled, fused) use nt_H consistently, matching origin/main's design: - Init/monolith: Cholesky factorizes in-place on nt_H (L overwrites H) - Graph loop fused kernel: reads H from nt_H, keeps L in shared memory only, writes Mgrad. nt_H is never modified by the fused kernel, so H is naturally preserved for patching. - Backward: reads L from nt_H (unchanged from main) The first graph-loop iteration always does a full Hessian rebuild (use_full_hessian=1 set in func_solve_init), so nt_H transitions from L (after init) back to H before the fused kernel runs. Saves n_dofs^2 * n_envs * 4 bytes of memory (e.g. 12 MB for dex_hand 4096 envs).

…irst iter Two bugs introduced by the nt_L elimination: 1. Init contamination: func_solve_init's Cholesky overwrites nt_H with L. On the first graph iteration, envs flagged incremental would patch L instead of H. Fixed by forcing use_full_hessian=1 when iter_count <= 1. 2. Non-fused path incompatibility: H patching was gated on Newton only, not Newton+tiled. For the batch path, Cholesky writes L to nt_H every iteration, destroying H for subsequent patches. Fixed by restricting H patching + fused Cholesky+Solve to the tiled path only; non-fused path now does full H+Cholesky every iteration (same as origin/main).

The fused Cholesky+Solve kernel keeps L in shared memory and never writes it to nt_H (which retains H for patching). The batch path still writes L to nt_H in-place. Comparing nt_H between paths now compares H vs L, which differ by design. Switch to comparing qacc (solver output) which validates end-to-end correctness of both Cholesky paths.

The runtime use_full_hessian check in func_hessian_direct_tiled was compiled into the kernel for all scenes, even those not using H patching. This added a global memory read + branch per block, and likely changed register allocation, causing a -23% regression on box_pyramid_6 (126 DOFs, batch/non-fused path). Make it a compile-time gate via check_full_hessian template parameter (default False). Only the H patching caller (_func_newton_only_nt_hessian) passes True; all other call sites get the check eliminated at compile time.

The tiled path now uses H patching + fused Cholesky (a different solver strategy than the batch path's full H+Cholesky every iteration), so internal solver state (qacc) can differ beyond fp32 tolerance after a single step. Compare robot qpos over 10 steps instead, which validates equivalent physics output while tolerating internal solver differences.

More accurate BLAS naming: syr (symmetric rank-1 update) for the diagonal GEMM subtract, ger (general rank-1 update) for the off-diagonal GEMM subtract.

Replace 280+ lines of hand-rolled r0..r15 register management and _tile_{syr_sub,ger_sub,potrf,trsm}_16 functions with Tile16 dataclass from quadrants.lang.simt.tile16. Both func_cholesky_factor_direct_tiled and func_cholesky_and_solve_fused_tiled are migrated. Net: -521 lines, +41 lines. Bump quadrants to 0.6.0b2.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Made-with: Cursor

…loops Made-with: Cursor

Made-with: Cursor

Replace manual tid-based column loads with arr[batch, r0:r_end, col] slice syntax for outer product updates in both func_cholesky_factor_direct_tiled and func_cholesky_and_solve_fused_tiled.

hughperkins · 2026-04-06T22:59:48Z

Note: updated the vector bits to use a VectorProxy too now, similar to the TileProxy:

                        v_own = L_sh[i0:n_dofs, j0 + t]
                        v_diag = L_sh[k0:n_dofs, j0 + t]
                        L_ik -= qd.outer(v_own, v_diag)

hughperkins · 2026-04-06T23:12:05Z

added some commentshere:

        # --- Scalar triangular solve using L from shared memory ---
        # No longer using 16x16 tiles; the 16 threads parallelize each row's
        # dot product by striping across columns, then butterfly-reduce to
        # sum the partial products. Thread 0 writes each solved element.

hughperkins · 2026-04-06T23:13:40Z

+
+
+@qd.func
+def _butterfly_reduce_16(val, tid):


I wonder if this should be moved into quadrants somewhere?

Probalby right?

(also, I kind of lean towards the name descrinbg the effect and behavior, rather than the detailed implementation?)

Yes that would be much better in Genesis where lower expertise from maintainers can be expected. Nice to mention the implementation in doc string though.

Indeed you should move it in Quadrants i think. My previous comment still hold since it would be public API.

…n-fuse-chol-tiles-api

github-actions · 2026-04-07T12:29:29Z

🔴 Benchmark Regression Detected ➡️ Report

…n-fuse-chol-tiles-api Made-with: Cursor # Conflicts: # pyproject.toml

…16x16 qd.types.Tile16x16 was removed in quadrants hp/tiles-5. Use the new public API: qd.simt.Tile16x16.zeros(dtype=...) for zero tiles.

Module-level alias `Tile16x16 = qd.simt.Tile16x16` caused the purity checker to flag `Tile16x16.SIZE` as a global variable access. Instead: - Use `_TILE = qd.simt.Tile16x16.SIZE` (int constant) for SIZE - Use `qd.simt.Tile16x16.zeros(dtype=...)` directly in kernels - Use `eye_()` (now public) instead of `_eye_()`

The purity checker rejects any module-level variable access inside kernels, including plain int constants like _TILE = 16. Replace all _TILE references with literal 16 inside kernel functions.

…n-fuse-chol-tiles-api

github-actions · 2026-04-19T14:18:13Z

🔴 Benchmark Regression Detected ➡️ Report

github-actions · 2026-04-20T15:53:16Z

🔴 Benchmark Regression Detected ➡️ Report

hughperkins added 19 commits April 4, 2026 16:35

Switch to per-env adaptive H patching decision

2e4778b

Replace fixed _N_FULL_HESSIAN_ITERS threshold with adaptive per-env policy: use H patching when less than half the constraints changed, full tiled rebuild otherwise. +7.2% on RTX PRO 6000 (13944 vs 13005).

bump quadrants to 0.6.0b1, fix gpu_graph -> graph rename

ee9335b

Update dependency to quadrants 0.6.0b1 which adds subgroupShuffle support needed by the tile16 Cholesky. Also fix the gpu_graph -> graph keyword rename in solver_breakdown.py for API compatibility. Made-with: Cursor

tile16 Cholesky writes L to nt_L, reads H from nt_H

b6ba7fe

Preserves H in nt_H for H patching. GEMM lookback reads from nt_L where prior L tiles were stored.

rename tile BLAS primitives: _tile_syr_sub_16, _tile_ger_sub_16

80f6d45

More accurate BLAS naming: syr (symmetric rank-1 update) for the diagonal GEMM subtract, ger (general rank-1 update) for the off-diagonal GEMM subtract.

add instrumented bp6 benchmark to diagnose performance regression

bac01c7

remove superfluous diagnostic benchmark scripts

0b9a67c

hughperkins requested review from YilingQiao and duburcqa as code owners April 5, 2026 02:16

claude bot reviewed Apr 5, 2026

View reviewed changes

pre-commit

f00edd7

hughperkins force-pushed the hp/incremental-hessian-fuse-chol-tiles-api branch 5 times, most recently from 710cf0c to b0babb2 Compare April 5, 2026 02:40

restore and update func_cholesky_factor_direct_tiled docstring

1de2b6c

hughperkins force-pushed the hp/incremental-hessian-fuse-chol-tiles-api branch from b0babb2 to 1de2b6c Compare April 5, 2026 02:41

duburcqa reviewed Apr 6, 2026

View reviewed changes

Comment thread genesis/engine/solvers/rigid/constraint/solver_breakdown.py Outdated

duburcqa reviewed Apr 6, 2026

View reviewed changes

Comment thread tests/test_rigid_physics.py Outdated

hughperkins added 10 commits April 6, 2026 17:04

move analysis to pr

9561149

Rewrap comments in changed lines to 120-char line width

5654141

Made-with: Cursor

Clarify _func_patch_hessian_delta docstring

ef1c3e2

Made-with: Cursor

Add block-level comments to Cholesky factorization functions

7bfed26

Made-with: Cursor

Restore +1 shared memory padding on L_sh to avoid bank conflicts

4c01929

Made-with: Cursor

Explain sequential column-block dependency in Cholesky factorization …

02ff753

…loops Made-with: Cursor

Fix Cholesky comment: off-diagonal rows are sequential, not parallel

a4943cf

Made-with: Cursor

Fix import order: group bare imports before from-imports

ae548f4

Made-with: Cursor

Use qd.types.Tile16x16(dtype=...) instead of make_tile16x16 import

5fe6d2e

Made-with: Cursor

Use VecSliceProxy syntax for column-vector loads in Cholesky kernels

cc4d140

Replace manual tid-based column loads with arr[batch, r0:r_end, col] slice syntax for outer product updates in both func_cholesky_factor_direct_tiled and func_cholesky_and_solve_fused_tiled.

Clarify comment: triangular solve uses scalar rows, not tiles

56a8aa6

hughperkins commented Apr 6, 2026

View reviewed changes

hughperkins added 2 commits April 7, 2026 06:47

12

cc978ea

Merge remote-tracking branch 'origin/main' into hp/incremental-hessia…

a1ec2fb

…n-fuse-chol-tiles-api

hughperkins added 7 commits April 15, 2026 07:09

Merge remote-tracking branch 'origin/main' into hp/incremental-hessia…

4b49f2a

…n-fuse-chol-tiles-api Made-with: Cursor # Conflicts: # pyproject.toml

fix: use qd.simt.Tile16x16 proxy API instead of removed qd.types.Tile…

7cb0e3b

…16x16 qd.types.Tile16x16 was removed in quadrants hp/tiles-5. Use the new public API: qd.simt.Tile16x16.zeros(dtype=...) for zero tiles.

fix: use _eye_() instead of eye_() — method is private in Tile16x16

17ec942

fix: inline tile size constant 16 to avoid purity-checker rejection

6b90dc5

The purity checker rejects any module-level variable access inside kernels, including plain int constants like _TILE = 16. Replace all _TILE references with literal 16 inside kernel functions.

Merge remote-tracking branch 'origin/main' into hp/incremental-hessia…

7714b96

…n-fuse-chol-tiles-api

00~v0.6.3b301~

e661aa3

hughperkins added 2 commits April 20, 2026 08:47

0.6.3

86c3233

precommit

adbb55c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Add register-only tiled cholesky, and incremental H patching, for performance.#2659

[MISC] Add register-only tiled cholesky, and incremental H patching, for performance.#2659
hughperkins wants to merge 67 commits intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/incremental-hessian-fuse-chol-tiles-api

hughperkins commented Apr 5, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 6, 2026

Uh oh!

hughperkins commented Apr 6, 2026

Uh oh!

hughperkins Apr 6, 2026

Uh oh!

hughperkins Apr 6, 2026

Uh oh!

hughperkins Apr 6, 2026 •

edited

Loading

Uh oh!

duburcqa Apr 7, 2026

Uh oh!

duburcqa Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 19, 2026

Uh oh!

github-actions bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@qd.func
		def _butterfly_reduce_16(val, tid):

Conversation

hughperkins commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 6, 2026

Uh oh!

hughperkins commented Apr 6, 2026

Uh oh!

hughperkins Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 19, 2026

Uh oh!

github-actions bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Apr 5, 2026 •

edited

Loading

hughperkins Apr 6, 2026 •

edited

Loading