GPS quadratic probe over-predicts memory, produces ~4% VRAM utilization

## Observation

`conv_type_gps_set_01_seed42` (job 8640865, Cardinal H100) completed with:

| Signal | Value |
|---|---|
| `peak_vram_mb` | 3,414 MB |
| Approx. free VRAM at probe | ~85 GB (H100 94 GB, ~90% free pre-probe) |
| Utilization | **~4% of target** |
| val_loss | 0.0097 (vs 0.0441 for GAT / 0.0441 for GATv2 — GPS converges fine) |
| Wall time | 7:45 (vs 15 min for GAT/GATv2) |
| Epochs | 133 (early stopped) |

GAT and GATv2 on the same run hit ~47.5 GB peak (~56% util) — healthy. GPS path is the outlier.

## Root cause

`graphids/core/data/budget.py::_gps_budget` does a 3-point quadratic fit `peak = αV² + βV + γ` at `V ∈ (500, 1500, 4000)`, then solves the quadratic for `V_max` such that the fitted curve hits `free × safety`. The extrapolation from 4k → whatever-V-solves-the-equation appears to overestimate α (quadratic growth), producing a node budget far below what actually fits. Actual peak never approaches target.

## Impact

GPS ablations still produce valid checkpoints and reasonable val_loss, but:
- Wall clock is longer than necessary (more, smaller batches)
- GPU memory is ~96% wasted
- Extending to larger datasets (set_02+) will be disproportionately slow on GPS

## Detection

This session added `graphids.budget_underutilized = true` + `graphids.budget_utilization_pct` MLflow tags in `MLflowTrainingCallback._check_budget_utilization`. Future GPS runs will self-flag. Query:

~~~
mlflow.search_runs(filter_string="tags.\`graphids.budget_underutilized\` = 'true'")
~~~

## Suggested fixes (any one would likely suffice)

1. **Add a 4th larger probe point** (e.g. 8000 nodes) to anchor the quadratic at a higher V, reducing extrapolation noise. Needs to fit on the smallest supported cluster (V100 16 GB); attention at V=8k with heads=4 is ~256 MB fwd, likely OK but worth a benchmark.
2. **Cap `budget` at a multiple of max probed V** — e.g. `min(solved_V, 4 × max_probe_V)`. Prevents wild extrapolation; conservative but bounded.
3. **Investigate α-vs-β balance** — if γ (intercept) absorbs noise poorly at small V, the quadratic fit may overweight α. Consider fitting in log-space or regularizing the α term.
4. **Fall back to single-axis node budget for GPS** with a hardcoded coefficient calibrated empirically on H100 / V100, scaled by `free`. Less principled but matches observed behavior.

## Related

- Introduced by this session's commit (see conv_type validation — all 3 ablations completed).
- Linked to `critical-constraints.md`'s dual-budget rule: edge_budget is derived from empirical edges-per-node, which is fine; the problem is purely the node-budget extrapolation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPS quadratic probe over-predicts memory, produces ~4% VRAM utilization #39

Observation

Root cause

Impact

Detection

Suggested fixes (any one would likely suffice)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Signal	Value
`peak_vram_mb`	3,414 MB
Approx. free VRAM at probe	~85 GB (H100 94 GB, ~90% free pre-probe)
Utilization	~4% of target
val_loss	0.0097 (vs 0.0441 for GAT / 0.0441 for GATv2 — GPS converges fine)
Wall time	7:45 (vs 15 min for GAT/GATv2)
Epochs	133 (early stopped)

Uh oh!

GPS quadratic probe over-predicts memory, produces ~4% VRAM utilization #39

Description

Observation

Root cause

Impact

Detection

Suggested fixes (any one would likely suffice)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions