[MISC] Add register-only tiled cholesky, and incremental H patching, for performance.#2659
Conversation
The Cholesky factorization was writing L back into nt_H, overwriting the Hessian. The H patching code then added J^T D J deltas to L instead of H, producing a wrong Hessian and degrading convergence 2x. Fix: add nt_L field to ConstraintState. Cholesky writes to nt_L, forward/backward substitution reads from nt_L. nt_H always contains the Hessian for correct patching. Result: +4.6% speedup over baseline (14991 vs 14332 FPS on dex_hand).
Replace fixed _N_FULL_HESSIAN_ITERS threshold with adaptive per-env policy: use H patching when less than half the constraints changed, full tiled rebuild otherwise. +7.2% on RTX PRO 6000 (13944 vs 13005).
…tiles Replace the scalar Cholesky-Crout (64 threads, shared memory, N syncs) with a blocked left-looking algorithm using 16x16 register-resident tiles: - tile_potrf_16: diagonal factorization via warp shuffles - tile_trsm_16: triangular solve via warp shuffles - tile_gemm_sub_16: symmetric GEMM subtract via warp shuffles - tile_gemm_sub_offdiag_16: off-diagonal GEMM subtract via warp shuffles All cross-thread communication uses subgroup shuffles. No shared memory or block synchronization needed. Occupancy limited only by registers (~32 warps vs ~7 with shared memory), enabling 3x+ throughput improvement on latency-bound workloads (benchmarked in perso_hugh/demos/cholesky_full.py). Requires n_dofs to be a multiple of 16 (satisfied by dex_hand n_dofs=64). Made-with: Cursor
When n_dofs is not a multiple of 16, partial tiles are padded with identity (diagonal=1, off-diagonal=0) so the factorization produces correct results for the original n_dofs x n_dofs submatrix. Bounds-checked loads/stores are extracted into _tile_load_diag_16, _tile_load_offdiag_16, and _tile_store_16. Fixes test_stickman (humanoid, ~27 DOFs) which was hitting out-of-bounds accesses with the previous tile16-only implementation. Made-with: Cursor
….func Two issues fixed: 1. qd.func cannot accept array/field parameters (AnyArray), so the _tile_load_diag_16, _tile_load_offdiag_16, _tile_store_16 helpers caused a QuadrantsTypeError. Removed them and inlined all bounds- checked loads/stores directly in the kernel body. 2. qd.static cannot reference n_dofs (a Quadrants Expr inside @qd.func), so the _NEEDS_PAD compile-time branch didn't work. Removed it and always use bounds-checked loads/stores — GPU predication handles the common aligned case with negligible overhead. Made-with: Cursor
Update dependency to quadrants 0.6.0b1 which adds subgroupShuffle support needed by the tile16 Cholesky. Also fix the gpu_graph -> graph keyword rename in solver_breakdown.py for API compatibility. Made-with: Cursor
Preserves H in nt_H for H patching. GEMM lookback reads from nt_L where prior L tiles were stored.
func_cholesky_and_solve_fused_tiled keeps L entirely in shared memory during both factorization (for GEMM lookback) and forward/backward solve. Never writes L to global memory, eliminating the nt_L write-allocate overhead that limited the hp/incremental-hessian branch to +1.7%. Graph loop reordered: gradient computed first (no solve), then fused kernel reads grad and writes Mgrad. Falls back to separate Cholesky+solve for non-tiled or CG paths. RTX PRO 6000 Blackwell, dex_hand 4096 envs: clean main: 13,489 FPS (303.7 ms/step) tile16 only: 13,722 FPS (298.5 ms/step, +1.7%) tile16 + H patch (nt_L): 14,061 FPS (291.3 ms/step, +4.2%) fused tile16 + H patch: 16,353 FPS (250.5 ms/step, +21.2%)
The batch path (n_envs > 16384 or n_dofs > 96) never uses H patching, so there's no need to preserve H in nt_H. Revert to the original in-place factorization on nt_H to eliminate the H→nt_L copy overhead that caused 3-8% regressions on 30000-env scenes and box_pyramid_6. The tiled path continues to use nt_L (or shared memory in the fused kernel) where H patching is active.
The fused Cholesky+Solve kernel used `if tid < n_dofs` to load the gradient and store Mgrad, which only handles elements 0..15 with 16 threads. For n_dofs > 16 (e.g. dex_hand n_dofs=27), elements 16+ were uninitialized garbage, causing solver divergence (NaN in test_stickman) and wrong contact forces. Fix: use strided loops (`while k < n_dofs: ... k += _CHOL_TILE`) for both gradient load and Mgrad store. Also fix test_cholesky_tiling to read the Cholesky factor from the correct field (nt_L for tiled path, nt_H for batch path).
Remove nt_L entirely. All Cholesky paths (batch, tiled, fused) use nt_H consistently, matching origin/main's design: - Init/monolith: Cholesky factorizes in-place on nt_H (L overwrites H) - Graph loop fused kernel: reads H from nt_H, keeps L in shared memory only, writes Mgrad. nt_H is never modified by the fused kernel, so H is naturally preserved for patching. - Backward: reads L from nt_H (unchanged from main) The first graph-loop iteration always does a full Hessian rebuild (use_full_hessian=1 set in func_solve_init), so nt_H transitions from L (after init) back to H before the fused kernel runs. Saves n_dofs^2 * n_envs * 4 bytes of memory (e.g. 12 MB for dex_hand 4096 envs).
…irst iter Two bugs introduced by the nt_L elimination: 1. Init contamination: func_solve_init's Cholesky overwrites nt_H with L. On the first graph iteration, envs flagged incremental would patch L instead of H. Fixed by forcing use_full_hessian=1 when iter_count <= 1. 2. Non-fused path incompatibility: H patching was gated on Newton only, not Newton+tiled. For the batch path, Cholesky writes L to nt_H every iteration, destroying H for subsequent patches. Fixed by restricting H patching + fused Cholesky+Solve to the tiled path only; non-fused path now does full H+Cholesky every iteration (same as origin/main).
The fused Cholesky+Solve kernel keeps L in shared memory and never writes it to nt_H (which retains H for patching). The batch path still writes L to nt_H in-place. Comparing nt_H between paths now compares H vs L, which differ by design. Switch to comparing qacc (solver output) which validates end-to-end correctness of both Cholesky paths.
The runtime use_full_hessian check in func_hessian_direct_tiled was compiled into the kernel for all scenes, even those not using H patching. This added a global memory read + branch per block, and likely changed register allocation, causing a -23% regression on box_pyramid_6 (126 DOFs, batch/non-fused path). Make it a compile-time gate via check_full_hessian template parameter (default False). Only the H patching caller (_func_newton_only_nt_hessian) passes True; all other call sites get the check eliminated at compile time.
The tiled path now uses H patching + fused Cholesky (a different solver strategy than the batch path's full H+Cholesky every iteration), so internal solver state (qacc) can differ beyond fp32 tolerance after a single step. Compare robot qpos over 10 steps instead, which validates equivalent physics output while tolerating internal solver differences.
More accurate BLAS naming: syr (symmetric rank-1 update) for the diagonal GEMM subtract, ger (general rank-1 update) for the off-diagonal GEMM subtract.
Replace 280+ lines of hand-rolled r0..r15 register management and
_tile_{syr_sub,ger_sub,potrf,trsm}_16 functions with Tile16 dataclass
from quadrants.lang.simt.tile16. Both func_cholesky_factor_direct_tiled
and func_cholesky_and_solve_fused_tiled are migrated.
Net: -521 lines, +41 lines. Bump quadrants to 0.6.0b2.
710cf0c to
b0babb2
Compare
b0babb2 to
1de2b6c
Compare
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…loops Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Replace manual tid-based column loads with arr[batch, r0:r_end, col] slice syntax for outer product updates in both func_cholesky_factor_direct_tiled and func_cholesky_and_solve_fused_tiled.
|
Note: updated the vector bits to use a VectorProxy too now, similar to the TileProxy: |
|
added some commentshere: |
|
|
||
|
|
||
| @qd.func | ||
| def _butterfly_reduce_16(val, tid): |
There was a problem hiding this comment.
I wonder if this should be moved into quadrants somewhere?
There was a problem hiding this comment.
Probalby right?
There was a problem hiding this comment.
(also, I kind of lean towards the name descrinbg the effect and behavior, rather than the detailed implementation?)
There was a problem hiding this comment.
Yes that would be much better in Genesis where lower expertise from maintainers can be expected. Nice to mention the implementation in doc string though.
There was a problem hiding this comment.
Indeed you should move it in Quadrants i think. My previous comment still hold since it would be public API.
…n-fuse-chol-tiles-api
|
🔴 Benchmark Regression Detected ➡️ Report |
…n-fuse-chol-tiles-api Made-with: Cursor # Conflicts: # pyproject.toml
…16x16 qd.types.Tile16x16 was removed in quadrants hp/tiles-5. Use the new public API: qd.simt.Tile16x16.zeros(dtype=...) for zero tiles.
Module-level alias `Tile16x16 = qd.simt.Tile16x16` caused the purity checker to flag `Tile16x16.SIZE` as a global variable access. Instead: - Use `_TILE = qd.simt.Tile16x16.SIZE` (int constant) for SIZE - Use `qd.simt.Tile16x16.zeros(dtype=...)` directly in kernels - Use `eye_()` (now public) instead of `_eye_()`
The purity checker rejects any module-level variable access inside kernels, including plain int constants like _TILE = 16. Replace all _TILE references with literal 16 inside kernel functions.
…n-fuse-chol-tiles-api
|
🔴 Benchmark Regression Detected ➡️ Report |
|
🔴 Benchmark Regression Detected ➡️ Report |
Description
Note that we need to combine both incremental H-patching and register tiled Cholesky, because the gains for each on their own are much more modest:
For dex_hand, with field, on RTX 6000 Blackwell:
This is because fusing both into a single kernel avoids having to make a round-trip to global memory in between them.
Related Issue
Resolves Genesis-Embodied-AI/Genesis#
Motivation and Context
How Has This Been / Can This Be Tested?
Screenshots (if appropriate):
Checklist:
Submitting Code Changessection of CONTRIBUTING document.