[MISC] Optimize GPU on low batch sizes for table busing scene. by hughperkins · Pull Request #2947 · Genesis-Embodied-AI/genesis-world

hughperkins · 2026-06-15T13:10:32Z

No description provided.

…oach A) Split func_noslip_batch's O(nefc) residual dot products across a 32-lane warp per env via subgroup.reduce_all_add_tiled, gated at compile time on enable_cooperative_constraint_kernels (GPU, small n_envs). The projected Gauss-Seidel order is preserved: all lanes recompute the scalar projection and write efc_force redundantly with the identical value, so each lane reads back its own writes in program order with no cross-lane fence. table_bus bs=1 GPU (QD_GRAPH=0 kernel profile): noslip sweep 90.5 -> 7.8 ms/step (~11.6x); total kernel 219 -> 128 ms/step. One-step output matches the serial sweep to ~3e-4 on qpos; 30-step controlled run tracks to ~1e-3.

…h C) Parallelize func_dual_finish_batch across a 32-lane warp per env: the O(n_dofs * nefc) qfrc = J^T f accumulation and the acc/force write-back are lane-strided over dofs, while the block-diagonal M^-1 solve runs redundantly on all lanes (SIMT lockstep -> ~1-lane wall cost, each lane reads back its own qacc writes). Gated, like approach A, on enable_cooperative_constraint_kernels. table_bus bs=1 GPU (QD_GRAPH=0 kernel profile): dual finish 15.4 -> 0.75 ms/step (~20x); total noslip now ~8.7 ms/step (was ~106). Total step kernel time 128 -> 113 ms/step on top of A (baseline 219). Correctness vs serial from identical state: 30-step controlled run tracks to ~1e-3 on positions.

… bs=1) The CG gradient step (_func_update_gradient) ran one thread per env: a per-dof grad write plus the per-entity LDL M^-1 solve. At bs=1 that single thread was the largest solve-graph node (~32 ms/step, QD_GRAPH=0). Parallelize over a 32-lane warp per env: lane-strided grad, then distribute the block-diagonal mass solve across lanes by entity (each entity is an independent diagonal block, so lanes owning different entities never touch the same dofs; the LDL substitution itself stays sequential). Gated on enable_cooperative_constraint_kernels and solver_type==CG; Newton keeps the serial path. table_bus bs=1 GPU: update_gradient 32 -> 6.7 ms/step (~4.8x); total step kernel 113 -> 100 ms/step; GPU fps 9 -> 12. Bit-identical to the serial path (per-entity solves and per-dof writes are unchanged, just distributed).

…GPU bs=1) The two remaining serial 1-thread/env CG nodes. - save_prev_grad (B): lane-strided copy over dofs; bit-identical. ~6.5 -> 2.6 ms/step (QD_GRAPH=0). - search-direction (C): lane-strided grad-norm and the two CG-beta dot products via reduce_all_add_tiled, plus a lane-strided search update. Not bit-identical (the reductions reorder fp adds) but converges to the same CG fixed point. Both gated on enable_cooperative_constraint_kernels (serial fallback kept). table_bus bs=1 GPU fps 12 -> 13; 30-step trajectory tracks the serial path within ~0.8 mm (links_pos).

Widen the cooperative no-slip force-update sweep from a single warp (32 lanes) to a multi-warp block (NoslipCoop.BLOCK=128 lanes) per env. The sweep is memory-latency-bound reading dense efc_AR rows at bs=1; more lanes per env keeps more loads in flight, hiding the latency. The two reductions (per-constraint residual dot product and the iter-0 improvement sum) now warp-shuffle within each warp then combine across warps via a tiny shared-memory tree. Block width is an IntEnum (not a module-level int) so the cooperative kernels stay fastcache-pure: enum captures are exempt from the purity check, plain globals are not. table_bus bs=1 GPU: noslip_sweep 788 us -> 471 us/call (nsys, graphs on). Sweep cost saturates at >=64 lanes; 128 leaves headroom for larger constraint counts at negligible cost (memory-bound, not compute-bound).

Replace the NoslipCoop IntEnum with a baked int field on the static struct: RigidSimStaticConfig.noslip_coop_block_dim (default 128). The sweep kernels read it via qd.static(static_rigid_sim_config. noslip_coop_block_dim), the same pattern as cholesky_tile_size. A module-level int tripped the fastcache purity check; the IntEnum only sidestepped it via the enum-capture exemption. A static-config member is the right home: its value is baked into the kernel (fastcache-pure) and it is configurable per build alongside the other solver tunables. No behavior change - compiled kernel is identical (still 128 lanes/env).

The decomposed noslip build solved every entity's mass-matrix LDL block for every constraint row, but a contact row's Jacobian only touches the 1-2 bodies in the contact, so most blocks have an all-zero RHS and M^-1 @ 0 = 0 is already present in MinvJT. Gate each per-entity solve on the row actually touching that entity. Bit-identical (verified 30-step onestep_compare, worst diff 0.0). table_bus bs=1 RTX 5090, decomposed kernel_1 (M^-1 J^T solve): 105us -> 48us.

The coop dual-finish ran at block_dim=32, leaving its latency-bound J^T f accumulation (the dominant O(n_dofs * nefc) term) under-parallelized. Widen it to noslip_coop_block_dim (128) for the same latency-hiding win as the sweep (E11). The in-place block-diagonal mass solve relied on single-warp lockstep to be safe when run redundantly, so confine it to the first warp (tid < 32) and fence its result to the rest of the block. Bit-identical (30-step onestep_compare, worst diff 0.0). table_bus bs=1 RTX 5090, dual_finish (kernel_8): 74us -> 63us.

The cooperative clamp/prune/sort kernel is latency-bound: for n_con > 32 it insertion-sorts contacts on lane 0, reading contact_sort_key/idx from global DRAM on every comparison (~75% scoreboard stalls, E13). Stage the keys+indices into shared memory, sort there (~30-cycle access), then write back. Same insertion-sort algorithm -> bit-identical (30-step onestep_compare, worst 0.0). Falls back to the in-place global sort when n_con exceeds the smem cap (512). table_bus bs=1 RTX 5090: clamp_prune 277us -> 224us; fps 23.46 -> 23.87.

…-1.8ms)

…1.1ms)

Replace the per-substep host `graph_counter.from_numpy(_n_iterations)` with a top-level for-loop inside `_kernel_solve_graph` that sets the counter on-device once per graph replay (enabled by the for-loop-mixed graph_do_while support in quadrants hp/qipc-integration). Eliminates ~10 synchronous host copies/step (nsys cuMemcpyDtoH 2269->159 calls). Bit-identical trajectory (qpos/vel sums match to 1e-8). Neutral on wall time: the drain it removed was overlapped, not on the critical path.

…% table_bus) At n_envs<=1 (PARA_LEVEL.PARTIAL) the two biggest GPU kernels were serialized onto a single thread: collision-Jacobian assembly (37.7% of step) and Jaref=J@qacc (19.6%), gated on `serialize = para_level < ALL`. These loops write disjoint per-(constraint,dof) outputs, so drop their threshold to PARTIAL via a `bs1_parallel_build` bitmask (env GS_PARA_BUILD, default 7=on; bit0 collision, bit1 Jaref, bit2 efc_force). CPU (NEVER) and multi-env (ALL) paths unchanged. Validated: a 150-step table_bus bs=1 run at PARTIAL with this on is BIT-IDENTICAL to the fully-parallel GS_PARA_LEVEL=ALL trajectory (the serial bs=1 path was the FP-order outlier). bit1/bit2 are bit-identical to serial even one-step. Total GPU kernel time 33.2 -> 16.0 ms/step; runtime_fps 15 -> 23. Also (E47, off by default, GS_NOSLIP_COMP=1): block-per-component noslip sweep using the block-scope reduce_all_add residual (one block per independent constraint-graph component, per-component convergence). Correct + stable, +7% in isolation; capped by redundant per-block label recompute. n_entities is now wired when the knob is on (was -1 unless requires_grad, the cause of earlier NaNs). Named several previously-anonymous constraint loops via loop_config(name=) so the kernel profiler is readable.

…RA_DYN, bit-identical, +2.9%)

…ve_init), bit-identical +3.8%

…rnel (+7.9%)

…ile is the slow loop driver Profiling (added block_size/grid_size to the kernel profiler dump) shows the in-kernel "serial" CG monolith runs at block_dim=32 - genuinely warp-cooperative, not serialized to 1 thread as the E55 docstring claimed. The strided phase loops need all 32 lanes and results are bit-identical, confirming cooperation. This validates that 32-threads/env is faster than the decomposed path at bs=1 (26.25 vs 24.33 fps), exactly as warp-cooperation wins at 4096 - same per-env work. The real slow axis is the iteration driver, not thread count: the graph_do_while variant launches 4 kernels/iter (body + check_early_exit + 2 tiny serial counter kernels) adding ~6 ms/step, so it loses (23.4 fps) to mode 1's in-kernel for...break. Also switch the single-warp monolith handoffs from block.sync (__syncthreads) to subgroup.sync (__syncwarp); add func_update_gradient_batch_coop_warp so the shared decomposed helper is untouched. Perf-neutral but the correct primitive. 3-way (GS_PARA_DYN=31, precise=400): decomposed 24.33 | mode1 26.25 | mode2 23.38. Mode 1 is the default coop monolith; mode 2 (==2) kept for reference.

… +3.6%) Finishes lever #1 from E57 - the forward-dynamics / contact-pipeline loops that still ran single-threaded at bs=1. All changes verified bit-identical (200-step table_bus qpos, GS_PARA_DYN=0 serial vs parallel -> max_abs_diff=0). - bs1_parallel_dynamics bit5: func_integrate (vel_next per-dof + qpos per-link) - bs1_parallel_dynamics bit6: func_torque_and_passive_force / func_update_force / func_bias_force / func_compute_qacc per-link/entity/dof loops (entity-tree passes parallel over independent entities). Default 31 -> 127. - func_update_gradient: warp-coop path for CG (reuses func_update_gradient_batch_coop) instead of the 1-thread scalar grad + per-entity LDL solve in func_solve_init, gated on enable_cooperative_constraint_kernels. - func_collision_clear: parallelize the non-hibernation per-contact clear over (contact, env); hibernation path (sequential compaction) left serial. Graph-mode --precise 3000: 37.57 -> 36.21 ms/step (-1.36ms), 26.62 -> 27.62 fps. Does not touch the CG/linesearch core (still depth/latency-bound, E57); does not change the CPU-parity conclusion.

hughperkins and others added 24 commits June 13, 2026 11:55

perf(rigid): graph=True on kernel_step_1/step_2 (E28 incremental)

b4c8534

perf(rigid): env-tunable cooperative block size; _P0_BLOCK=64 (E30)

c42cc60

perf(rigid): fuse check_early_exit 3->1 kernel (E31, -3.6ms)

ca0c1bb

perf(rigid): fuse save_prev_grad into update_gradient (E32, -1.9ms)

337f50f

perf(rigid): fuse update_search_direction into gradient kernel (E33, …

5d27b5b

…-1.8ms)

perf(rigid): fuse update_constraint_cost into fused CG kernel (E34, -…

afc0343

…1.1ms)

feat(rigid): GS_NOSLIP_BLOCK env knob for noslip block-dim sweeps (E42)

cd03ed2

perf(rigid): parallelize bs=1 forward-dynamics CRBA loops (E52, GS_PA…

34a40af

…RA_DYN, bit-identical, +2.9%)

E53: parallelize remaining bs=1 serial kernels (geom_aabbs, qacc, sol…

f28a5fe

…ve_init), bit-identical +3.8%

E54: add dbg_iter_accum to measure CG iteration count (diagnosis)

4876f50

E55: CG warp-per-env monolith - fuse whole iteration loop into one ke…

39a545c

…rnel (+7.9%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Optimize GPU on low batch sizes for table busing scene.#2947

[MISC] Optimize GPU on low batch sizes for table busing scene.#2947
hughperkins wants to merge 24 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/cg-monolith

hughperkins commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant