-
Notifications
You must be signed in to change notification settings - Fork 0
Backend Validation
📄 Canonical source:
docs/design/backend-validation.md— edit there, not in the wiki.
Validation of the omp/, mpi/, and cuda/ backends against the
state-of-the-art parallel-AMR literature (p4est, Dendro-5.01, t8code), from a
verified deep-research pass (2026-06-03). Citation keys in
REFERENCES.md. Companion: gpu-balance-notes.md.
The overall design is validated as SOTA. A linear Morton octree (sorted
leaf array + per-leaf level), SFC-based partitioning, 2:1 balance, and a
bisect/lower_bound reference oracle are exactly the design family of p4est
[BurstWG2011], Dendro-5.01, and t8code [Holke2018]. The gaps below are
performance/scalability upgrades, not correctness defects — our backends
produce correct, verified meshes today (cross-backend parity + geometric 2:1
checks pass).
Honesty notes from the research: (1) the strong claim that multi-round ripple balance is "strictly inferior / wrong" was refuted — ripple is genuinely correct and "inherently iterative" locally; the one-pass scheme is a recommended optimization, not a fix. (2) The verified corpus was distributed/MPI-centric; OpenMP-specific best-practice claims (scan, NUMA, false sharing) were not independently evidenced, so omp/ items below lean on general HPC practice, not cited AMR sources.
| Aspect | Our code | SOTA | Action |
|---|---|---|---|
| Data structure | SoA linear octree, Uninit<T> to skip zero-init |
matches p4est/t8code | none |
| Refinement scan | hand-rolled two-pass exclusive_scan (thread partials + serial fixup), omp/core.hpp
|
work-efficient; serial fixup is O(#threads), negligible | keep; optionally benchmark vs std::exclusive_scan(std::execution::par_unseq)
|
| 2:1 balance | ripple, re-scans all N leaves every pass via std::lower_bound (tree.hpp:358) |
same O(total)-per-pass cost the CUDA backend already removed | Port the CUDA active-front balance (re-check only the front; byte-identical, ~1.8× on thin features). Highest-value omp/ upgrade. |
| Balance flag write |
#pragma omp atomic write flag=1 (tree.hpp:418) |
monotonic, all writes identical → benign | none (the earlier "atomic-OR" concern is moot since every write is 1) |
| Work scheduling |
schedule(dynamic,1024) on the balance scan |
reasonable for graded load imbalance | keep; watch false sharing on wksp_flags (byte array, dense writes) |
| Aspect | Our code | SOTA | Action |
|---|---|---|---|
| Partition | equal-element SFC via MPI_Exscan partition map (repartition(), tree.hpp:272) |
p4est default is equal-element, computed without communication | validated — matches SOTA default |
| Partition (imbalance) | equal-count only | weighted partition via per-octant cost callback; Hilbert SFC (Dendro default) lowers ghost surface vs Morton | For crack work, add an optional weight/cost model (helps only if deep octants cost more per element). Consider Hilbert ordering to cut halo volume. |
| 2:1 balance | multi-round ripple: re-fetch_ghosts() + update_partition_map() every pass (tree.hpp:209-211), MPI_Allreduce for convergence |
p4est is non-iterative: local balance → one insulation-layer exchange → postbalance ([IBG2012]; insulation layer = 3^d envelope) | Recommended upgrade to single-round insulation-layer balance. Not a correctness fix (ripple is correct), but removes per-pass communication. |
| Ghost construction | per-leaf neighbor enumeration, loops over all ranks per needed node (fetch_ghosts(), tree.hpp:403) → O(needed·P) |
recursive top-down Search, runtime ∝ partition-boundary leaves, no 2:1 precondition ([IBWG2015]; Holke/Knapp/Burstedde 2021) |
Replace the O(needed·P) rank scan with a boundary-driven construction; bound work by boundary size, not P. |
| Ghost exchange | dense two-phase MPI_Alltoall + MPI_Alltoallv (exchange_data(), tree.hpp:437) |
p4est uses a dedicated ghost object + sparse symmetric non-blocking point-to-point (R_pq≠∅ ⟺ R_qp≠∅), not global Alltoallv | Move to sparse pairwise exchange or MPI_Neighbor_alltoallv on a distributed-graph comm. Biggest scalability win at high rank counts. |
| Empty partitions | all ranks reach collectives (coarsen-deadlock fixed, tree.hpp:151-156) |
required invariant | validated — correct |
The pure-Python Quadtree/Octree formerly served as the parity oracle: its
balance predicate ("o,r unbalanced only if o ∈ I(r)") is exactly a lower_bound
insulation check, validated as a trustworthy ground truth. The Python backend
was removed (2026-06-05) once the in-tree C++ balance_ref() oracle (the
brute-force reference in omp/tests.cpp / cuda/tree.cuh) subsumed that role;
cross-backend parity now compares omp ↔ mpi ↔ cuda directly. The canonical AMR
invariants (sorted+unique, volume=1, 2:1, Morton round-trip, idempotent balance)
remain asserted in the C++ Catch2 suites.
- omp/ active-front balance — port the verified CUDA optimization; byte-identical, biggest single-node win. (Low risk, high value.)
- mpi/ ghost exchange → sparse/neighborhood collectives — replace dense Alltoallv + O(needed·P) rank scan. (Scalability-critical at many ranks.)
- mpi/ single-round insulation-layer balance ([IBG2012]) — removes per-pass halo communication. (Recommended, not correctness.)
- mpi/ ghost construction ∝ boundary ([IBWG2015]) — boundary-driven, not per-rank. Pairs with #2.
- Weighted/Hilbert partitioning for crack imbalance — only if a per-octant cost model shows deep octants are costlier; Hilbert lowers halo surface.
- Oracle invariant coverage — confirm the five invariants are all asserted; cheap insurance for cross-backend parity.
- OpenMP-specific items (#1 aside) are general-practice, not cited-AMR-evidenced.
- "Ripple is obsolete" was refuted; treat balance upgrades as performance, not correctness.
- The Dendro 1.3×10¹² octant / 4 s benchmark is a historical self-reported figure (decommissioned Titan), scalability demonstration only.
- V&V specifics (exact tests p4est/Dendro/t8code ship) were not resolved — our cross-backend differential + geometric checks remain the practical ground truth.