Skip to content

GPS quadratic probe over-predicts memory, produces ~4% VRAM utilization #39

Description

@RobertFrenken

Observation

conv_type_gps_set_01_seed42 (job 8640865, Cardinal H100) completed with:

Signal Value
peak_vram_mb 3,414 MB
Approx. free VRAM at probe ~85 GB (H100 94 GB, ~90% free pre-probe)
Utilization ~4% of target
val_loss 0.0097 (vs 0.0441 for GAT / 0.0441 for GATv2 — GPS converges fine)
Wall time 7:45 (vs 15 min for GAT/GATv2)
Epochs 133 (early stopped)

GAT and GATv2 on the same run hit ~47.5 GB peak (~56% util) — healthy. GPS path is the outlier.

Root cause

graphids/core/data/budget.py::_gps_budget does a 3-point quadratic fit peak = αV² + βV + γ at V ∈ (500, 1500, 4000), then solves the quadratic for V_max such that the fitted curve hits free × safety. The extrapolation from 4k → whatever-V-solves-the-equation appears to overestimate α (quadratic growth), producing a node budget far below what actually fits. Actual peak never approaches target.

Impact

GPS ablations still produce valid checkpoints and reasonable val_loss, but:

  • Wall clock is longer than necessary (more, smaller batches)
  • GPU memory is ~96% wasted
  • Extending to larger datasets (set_02+) will be disproportionately slow on GPS

Detection

This session added graphids.budget_underutilized = true + graphids.budget_utilization_pct MLflow tags in MLflowTrainingCallback._check_budget_utilization. Future GPS runs will self-flag. Query:

mlflow.search_runs(filter_string="tags.\`graphids.budget_underutilized\` = 'true'")

Suggested fixes (any one would likely suffice)

  1. Add a 4th larger probe point (e.g. 8000 nodes) to anchor the quadratic at a higher V, reducing extrapolation noise. Needs to fit on the smallest supported cluster (V100 16 GB); attention at V=8k with heads=4 is ~256 MB fwd, likely OK but worth a benchmark.
  2. Cap budget at a multiple of max probed V — e.g. min(solved_V, 4 × max_probe_V). Prevents wild extrapolation; conservative but bounded.
  3. Investigate α-vs-β balance — if γ (intercept) absorbs noise poorly at small V, the quadratic fit may overweight α. Consider fitting in log-space or regularizing the α term.
  4. Fall back to single-axis node budget for GPS with a hardcoded coefficient calibrated empirically on H100 / V100, scaled by free. Less principled but matches observed behavior.

Related

  • Introduced by this session's commit (see conv_type validation — all 3 ablations completed).
  • Linked to critical-constraints.md's dual-budget rule: edge_budget is derived from empirical edges-per-node, which is fine; the problem is purely the node-budget extrapolation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    modelsModel code, fusion, KDperformanceGPU utilization, batching, workers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions