Observation
conv_type_gps_set_01_seed42 (job 8640865, Cardinal H100) completed with:
| Signal |
Value |
peak_vram_mb |
3,414 MB |
| Approx. free VRAM at probe |
~85 GB (H100 94 GB, ~90% free pre-probe) |
| Utilization |
~4% of target |
| val_loss |
0.0097 (vs 0.0441 for GAT / 0.0441 for GATv2 — GPS converges fine) |
| Wall time |
7:45 (vs 15 min for GAT/GATv2) |
| Epochs |
133 (early stopped) |
GAT and GATv2 on the same run hit ~47.5 GB peak (~56% util) — healthy. GPS path is the outlier.
Root cause
graphids/core/data/budget.py::_gps_budget does a 3-point quadratic fit peak = αV² + βV + γ at V ∈ (500, 1500, 4000), then solves the quadratic for V_max such that the fitted curve hits free × safety. The extrapolation from 4k → whatever-V-solves-the-equation appears to overestimate α (quadratic growth), producing a node budget far below what actually fits. Actual peak never approaches target.
Impact
GPS ablations still produce valid checkpoints and reasonable val_loss, but:
- Wall clock is longer than necessary (more, smaller batches)
- GPU memory is ~96% wasted
- Extending to larger datasets (set_02+) will be disproportionately slow on GPS
Detection
This session added graphids.budget_underutilized = true + graphids.budget_utilization_pct MLflow tags in MLflowTrainingCallback._check_budget_utilization. Future GPS runs will self-flag. Query:
mlflow.search_runs(filter_string="tags.\`graphids.budget_underutilized\` = 'true'")
Suggested fixes (any one would likely suffice)
- Add a 4th larger probe point (e.g. 8000 nodes) to anchor the quadratic at a higher V, reducing extrapolation noise. Needs to fit on the smallest supported cluster (V100 16 GB); attention at V=8k with heads=4 is ~256 MB fwd, likely OK but worth a benchmark.
- Cap
budget at a multiple of max probed V — e.g. min(solved_V, 4 × max_probe_V). Prevents wild extrapolation; conservative but bounded.
- Investigate α-vs-β balance — if γ (intercept) absorbs noise poorly at small V, the quadratic fit may overweight α. Consider fitting in log-space or regularizing the α term.
- Fall back to single-axis node budget for GPS with a hardcoded coefficient calibrated empirically on H100 / V100, scaled by
free. Less principled but matches observed behavior.
Related
- Introduced by this session's commit (see conv_type validation — all 3 ablations completed).
- Linked to
critical-constraints.md's dual-budget rule: edge_budget is derived from empirical edges-per-node, which is fine; the problem is purely the node-budget extrapolation.
Observation
conv_type_gps_set_01_seed42(job 8640865, Cardinal H100) completed with:peak_vram_mbGAT and GATv2 on the same run hit ~47.5 GB peak (~56% util) — healthy. GPS path is the outlier.
Root cause
graphids/core/data/budget.py::_gps_budgetdoes a 3-point quadratic fitpeak = αV² + βV + γatV ∈ (500, 1500, 4000), then solves the quadratic forV_maxsuch that the fitted curve hitsfree × safety. The extrapolation from 4k → whatever-V-solves-the-equation appears to overestimate α (quadratic growth), producing a node budget far below what actually fits. Actual peak never approaches target.Impact
GPS ablations still produce valid checkpoints and reasonable val_loss, but:
Detection
This session added
graphids.budget_underutilized = true+graphids.budget_utilization_pctMLflow tags inMLflowTrainingCallback._check_budget_utilization. Future GPS runs will self-flag. Query:Suggested fixes (any one would likely suffice)
budgetat a multiple of max probed V — e.g.min(solved_V, 4 × max_probe_V). Prevents wild extrapolation; conservative but bounded.free. Less principled but matches observed behavior.Related
critical-constraints.md's dual-budget rule: edge_budget is derived from empirical edges-per-node, which is fine; the problem is purely the node-budget extrapolation.