Skip to content

[MISC] Optimize GPU on low batch sizes for table busing scene.#2947

Draft
hughperkins wants to merge 24 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/cg-monolith
Draft

[MISC] Optimize GPU on low batch sizes for table busing scene.#2947
hughperkins wants to merge 24 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/cg-monolith

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

No description provided.

hughperkins and others added 24 commits June 13, 2026 11:55
…oach A)

Split func_noslip_batch's O(nefc) residual dot products across a 32-lane
warp per env via subgroup.reduce_all_add_tiled, gated at compile time on
enable_cooperative_constraint_kernels (GPU, small n_envs). The projected
Gauss-Seidel order is preserved: all lanes recompute the scalar projection
and write efc_force redundantly with the identical value, so each lane reads
back its own writes in program order with no cross-lane fence.

table_bus bs=1 GPU (QD_GRAPH=0 kernel profile): noslip sweep 90.5 -> 7.8
ms/step (~11.6x); total kernel 219 -> 128 ms/step. One-step output matches
the serial sweep to ~3e-4 on qpos; 30-step controlled run tracks to ~1e-3.
…h C)

Parallelize func_dual_finish_batch across a 32-lane warp per env: the
O(n_dofs * nefc) qfrc = J^T f accumulation and the acc/force write-back are
lane-strided over dofs, while the block-diagonal M^-1 solve runs redundantly
on all lanes (SIMT lockstep -> ~1-lane wall cost, each lane reads back its own
qacc writes). Gated, like approach A, on enable_cooperative_constraint_kernels.

table_bus bs=1 GPU (QD_GRAPH=0 kernel profile): dual finish 15.4 -> 0.75
ms/step (~20x); total noslip now ~8.7 ms/step (was ~106). Total step kernel
time 128 -> 113 ms/step on top of A (baseline 219). Correctness vs serial from
identical state: 30-step controlled run tracks to ~1e-3 on positions.
… bs=1)

The CG gradient step (_func_update_gradient) ran one thread per env: a per-dof
grad write plus the per-entity LDL M^-1 solve. At bs=1 that single thread was
the largest solve-graph node (~32 ms/step, QD_GRAPH=0). Parallelize over a
32-lane warp per env: lane-strided grad, then distribute the block-diagonal
mass solve across lanes by entity (each entity is an independent diagonal
block, so lanes owning different entities never touch the same dofs; the LDL
substitution itself stays sequential). Gated on
enable_cooperative_constraint_kernels and solver_type==CG; Newton keeps the
serial path.

table_bus bs=1 GPU: update_gradient 32 -> 6.7 ms/step (~4.8x); total step
kernel 113 -> 100 ms/step; GPU fps 9 -> 12. Bit-identical to the serial path
(per-entity solves and per-dof writes are unchanged, just distributed).
…GPU bs=1)

The two remaining serial 1-thread/env CG nodes.
- save_prev_grad (B): lane-strided copy over dofs; bit-identical. ~6.5 -> 2.6
  ms/step (QD_GRAPH=0).
- search-direction (C): lane-strided grad-norm and the two CG-beta dot products
  via reduce_all_add_tiled, plus a lane-strided search update. Not bit-identical
  (the reductions reorder fp adds) but converges to the same CG fixed point.

Both gated on enable_cooperative_constraint_kernels (serial fallback kept).
table_bus bs=1 GPU fps 12 -> 13; 30-step trajectory tracks the serial path
within ~0.8 mm (links_pos).
Widen the cooperative no-slip force-update sweep from a single warp (32
lanes) to a multi-warp block (NoslipCoop.BLOCK=128 lanes) per env. The
sweep is memory-latency-bound reading dense efc_AR rows at bs=1; more
lanes per env keeps more loads in flight, hiding the latency. The two
reductions (per-constraint residual dot product and the iter-0
improvement sum) now warp-shuffle within each warp then combine across
warps via a tiny shared-memory tree.

Block width is an IntEnum (not a module-level int) so the cooperative
kernels stay fastcache-pure: enum captures are exempt from the purity
check, plain globals are not.

table_bus bs=1 GPU: noslip_sweep 788 us -> 471 us/call (nsys, graphs on).
Sweep cost saturates at >=64 lanes; 128 leaves headroom for larger
constraint counts at negligible cost (memory-bound, not compute-bound).
Replace the NoslipCoop IntEnum with a baked int field on the static
struct: RigidSimStaticConfig.noslip_coop_block_dim (default 128). The
sweep kernels read it via qd.static(static_rigid_sim_config.
noslip_coop_block_dim), the same pattern as cholesky_tile_size.

A module-level int tripped the fastcache purity check; the IntEnum only
sidestepped it via the enum-capture exemption. A static-config member is
the right home: its value is baked into the kernel (fastcache-pure) and
it is configurable per build alongside the other solver tunables. No
behavior change - compiled kernel is identical (still 128 lanes/env).
The decomposed noslip build solved every entity's mass-matrix LDL block for
every constraint row, but a contact row's Jacobian only touches the 1-2 bodies
in the contact, so most blocks have an all-zero RHS and M^-1 @ 0 = 0 is already
present in MinvJT. Gate each per-entity solve on the row actually touching that
entity. Bit-identical (verified 30-step onestep_compare, worst diff 0.0).

table_bus bs=1 RTX 5090, decomposed kernel_1 (M^-1 J^T solve): 105us -> 48us.
The coop dual-finish ran at block_dim=32, leaving its latency-bound J^T f
accumulation (the dominant O(n_dofs * nefc) term) under-parallelized. Widen it
to noslip_coop_block_dim (128) for the same latency-hiding win as the sweep
(E11). The in-place block-diagonal mass solve relied on single-warp lockstep to
be safe when run redundantly, so confine it to the first warp (tid < 32) and
fence its result to the rest of the block. Bit-identical (30-step
onestep_compare, worst diff 0.0).

table_bus bs=1 RTX 5090, dual_finish (kernel_8): 74us -> 63us.
The cooperative clamp/prune/sort kernel is latency-bound: for n_con > 32 it
insertion-sorts contacts on lane 0, reading contact_sort_key/idx from global
DRAM on every comparison (~75% scoreboard stalls, E13). Stage the keys+indices
into shared memory, sort there (~30-cycle access), then write back. Same
insertion-sort algorithm -> bit-identical (30-step onestep_compare, worst 0.0).
Falls back to the in-place global sort when n_con exceeds the smem cap (512).

table_bus bs=1 RTX 5090: clamp_prune 277us -> 224us; fps 23.46 -> 23.87.
Replace the per-substep host `graph_counter.from_numpy(_n_iterations)` with a
top-level for-loop inside `_kernel_solve_graph` that sets the counter on-device
once per graph replay (enabled by the for-loop-mixed graph_do_while support in
quadrants hp/qipc-integration). Eliminates ~10 synchronous host copies/step
(nsys cuMemcpyDtoH 2269->159 calls). Bit-identical trajectory (qpos/vel sums
match to 1e-8). Neutral on wall time: the drain it removed was overlapped, not
on the critical path.
…% table_bus)

At n_envs<=1 (PARA_LEVEL.PARTIAL) the two biggest GPU kernels were serialized
onto a single thread: collision-Jacobian assembly (37.7% of step) and
Jaref=J@qacc (19.6%), gated on `serialize = para_level < ALL`. These loops
write disjoint per-(constraint,dof) outputs, so drop their threshold to PARTIAL
via a `bs1_parallel_build` bitmask (env GS_PARA_BUILD, default 7=on; bit0
collision, bit1 Jaref, bit2 efc_force). CPU (NEVER) and multi-env (ALL) paths
unchanged.

Validated: a 150-step table_bus bs=1 run at PARTIAL with this on is BIT-IDENTICAL
to the fully-parallel GS_PARA_LEVEL=ALL trajectory (the serial bs=1 path was the
FP-order outlier). bit1/bit2 are bit-identical to serial even one-step. Total GPU
kernel time 33.2 -> 16.0 ms/step; runtime_fps 15 -> 23.

Also (E47, off by default, GS_NOSLIP_COMP=1): block-per-component noslip sweep
using the block-scope reduce_all_add residual (one block per independent
constraint-graph component, per-component convergence). Correct + stable, +7% in
isolation; capped by redundant per-block label recompute. n_entities is now wired
when the knob is on (was -1 unless requires_grad, the cause of earlier NaNs).

Named several previously-anonymous constraint loops via loop_config(name=) so
the kernel profiler is readable.
…ile is the slow loop driver

Profiling (added block_size/grid_size to the kernel profiler dump) shows the
in-kernel "serial" CG monolith runs at block_dim=32 - genuinely warp-cooperative,
not serialized to 1 thread as the E55 docstring claimed. The strided phase loops
need all 32 lanes and results are bit-identical, confirming cooperation.

This validates that 32-threads/env is faster than the decomposed path at bs=1
(26.25 vs 24.33 fps), exactly as warp-cooperation wins at 4096 - same per-env work.
The real slow axis is the iteration driver, not thread count: the graph_do_while
variant launches 4 kernels/iter (body + check_early_exit + 2 tiny serial counter
kernels) adding ~6 ms/step, so it loses (23.4 fps) to mode 1's in-kernel for...break.

Also switch the single-warp monolith handoffs from block.sync (__syncthreads) to
subgroup.sync (__syncwarp); add func_update_gradient_batch_coop_warp so the shared
decomposed helper is untouched. Perf-neutral but the correct primitive.

3-way (GS_PARA_DYN=31, precise=400): decomposed 24.33 | mode1 26.25 | mode2 23.38.
Mode 1 is the default coop monolith; mode 2 (==2) kept for reference.
… +3.6%)

Finishes lever #1 from E57 - the forward-dynamics / contact-pipeline loops
that still ran single-threaded at bs=1. All changes verified bit-identical
(200-step table_bus qpos, GS_PARA_DYN=0 serial vs parallel -> max_abs_diff=0).

- bs1_parallel_dynamics bit5: func_integrate (vel_next per-dof + qpos per-link)
- bs1_parallel_dynamics bit6: func_torque_and_passive_force / func_update_force /
  func_bias_force / func_compute_qacc per-link/entity/dof loops (entity-tree
  passes parallel over independent entities). Default 31 -> 127.
- func_update_gradient: warp-coop path for CG (reuses func_update_gradient_batch_coop)
  instead of the 1-thread scalar grad + per-entity LDL solve in func_solve_init,
  gated on enable_cooperative_constraint_kernels.
- func_collision_clear: parallelize the non-hibernation per-contact clear over
  (contact, env); hibernation path (sequential compaction) left serial.

Graph-mode --precise 3000: 37.57 -> 36.21 ms/step (-1.36ms), 26.62 -> 27.62 fps.
Does not touch the CG/linesearch core (still depth/latency-bound, E57); does not
change the CPU-parity conclusion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant