[PERF] improve MPM solver speed.#2720
[PERF] improve MPM solver speed.#2720Kashu7100 wants to merge 3 commits intoGenesis-Embodied-AI:mainfrom
Conversation
- Drop the unused [f+1] grid frame; grid is only indexed at [f] in p2g/g2p. - Fuse compute_F_tmp + svd on the forward pass to keep F_tmp in registers; keep them separate on the autodiff path so the backward composition is unchanged. - Rate-limit _is_state_valid to every 10 substeps so the NaN check no longer forces a GPU->CPU sync every substep. - Add benches/mpm_rigid_bench.py: Franka-squeezes-elastic-cube harness that reports per-step/substep wall-clock, peak GPU memory, and a mean-position fingerprint so variants can be compared against a baseline run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MPM materials now expose `needs_svd`. For Elastic(neohooken) and non-viscous Liquid, neither the F-update nor the stress uses U/V/S, so the SVD kernel is pure waste. When no registered material needs SVD, the solver dispatches to a new `compute_F_tmp_only` kernel and p2g reads J from `F_tmp.determinant()` via a qd.static branch (det(F_tmp) == det(S) for qd.svd's proper-rotation U/V). Also split grid reset: the non-differentiable forward path uses `reset_grid`, which skips the grad-buffer writes that `reset_grid_and_grad` still performs on the autodiff path. Halves reset DRAM traffic for forward-only runs. Bench impact (Franka gripper + Elastic(corotation) cube, 4 envs): baseline (3 runs): 9.77 / 9.71 / 9.24 ms/step, mean 9.58 plan 1 (3 runs): 9.73 / 9.59 / 9.68 ms/step, mean 9.67 Within run-to-run variance; corotation still requires SVD on this scene, so the SVD-skip path isn't exercised. Fingerprint identical across runs (0.641477, 0.001549, 0.064053). Verified neohooken / non-viscous liquid scenes now take the SVD-free path end-to-end. Harness: `benches/mpm_rigid_bench.py` now prepends the repo root to sys.path so the editable checkout wins over any site-packages `genesis` namespace stub left behind by a prior wheel install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dense reset_grid zeros all ~30K cells per batch every substep even when particles only touch a few percent of them. Replace on the forward path with a per-slot dirty list: - init_grid_fields now allocates grid_dirty_list (substeps_local, max_dirty_cells, B) and grid_dirty_count (substeps_local, B), sized from n_particles * 27 (each particle scatters to a 3^3 neighbourhood, so n_particles*27 is the exact upper bound on distinct cells touched per batch). - p2g's grid-scatter now captures the prior mass via atomic_add; the unique thread that sees prev_mass == 0 appends the flat cell index into grid_dirty_list[f, :, :]. Guarded by qd.static on `_sparse_reset_enabled = not requires_grad` so the autodiff composition of p2g.grad is untouched. - sparse_reset_grid(f) zeros only the cells in grid_dirty_list[f, :, count], then clears the counter in the same kernel. Fields are zero-initialized, so the first pass through each slot is a correct no-op (grid already zero). - substep_pre_coupling forward path now calls sparse_reset_grid instead of reset_grid; the diff path still calls reset_grid_and_grad unchanged. Bench impact (Franka gripper + Elastic(corotation) cube, 4 envs): plan 1 (3 runs): 9.73 / 9.59 / 9.68 ms/step, mean 9.67 plan 2 (3 runs): 9.11 / 9.00 / 9.06 ms/step, mean 9.05 ~6.5% speedup, fingerprint identical (0.641477, 0.001549, 0.064053). For larger grids or sparser particle occupancy the relative win scales with grid_size / (n_particles * 27). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| "fast_simplification>=0.1.12", | ||
| # Surface reconstruction library for particle data from SPH simulations | ||
| "pysplashsurf==0.14.*", | ||
| "torch>=2.11.0", |
| @@ -0,0 +1,251 @@ | |||
| """ | |||
There was a problem hiding this comment.
remove this file
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c117c2c89
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "fast_simplification>=0.1.12", | ||
| # Surface reconstruction library for particle data from SPH simulations | ||
| "pysplashsurf==0.14.*", | ||
| "torch>=2.11.0", |
There was a problem hiding this comment.
Remove hard Torch dependency from base install
Adding torch>=2.11.0 to core project.dependencies makes uv sync resolve Torch from the default index before users can choose the platform-specific wheel, which conflicts with the documented setup flow (README.md installs Torch separately via CUDA/CPU/MPS indexes) and can fail or pull an incompatible backend build on GPU/Metal environments. This change can block environment setup for supported platforms, so Torch should remain an explicit post-sync install (or be moved to a backend-specific extra) rather than a mandatory base dependency.
Useful? React with 👍 / 👎.
|
🔴 Benchmark Regression Detected ➡️ Report |
Description
Related Issue
Resolves Genesis-Embodied-AI/Genesis#
Motivation and Context
How Has This Been / Can This Be Tested?
Screenshots (if appropriate):
Checklist:
Submitting Code Changessection of CONTRIBUTING document.