Skip to content

Methodology: blended CO₂/token metric hides workload and utilization effects #2

@cboettig

Description

@cboettig

Context

Following the work in #1 (now closed), the dashboard correctly measures GPU power from DCGM and divides by total token throughput (prompt + generation) from vLLM. The commercial baseline uses a physics-constrained envelope with separate prefill/decode J/token estimates blended by workload preset.

A close examination of real Prometheus data and the comparison logic reveals several areas where the methodology could be more transparent or more precise.

Findings

1. NRP measurement is sound but inherently blended

DCGM reports a single power reading per GPU. We cannot decompose it into "watts spent on prefill" vs "watts spent on decode." The formula:

CO₂/token = avg_power × grid_intensity / total_tokens_per_sec

gives the real amortized energy cost per token — honest carbon accounting. But it's a weighted average whose value depends on the workload mix (prompt-heavy vs generation-heavy).

Real NRP data shows the current workload is heavily prompt-dominated (83–99% prompt tokens across active models), which is consistent with the agentic preset's 95.2%. The blending is not currently distorting the comparison, but could if workload patterns shift.

2. Idle/underutilization inflates NRP's per-token cost

The NRP metric includes idle GPU power in the numerator. Example from current data:

Model Power (W) Total tok/s J/token
GLM-4.7-FP8 1320 3158 0.42
Qwen3.5-397B 1235 2392 0.52
GPT-OSS-120B 597 2937 0.20
Kimi-K2.5 928 158 5.86
MiniMax-M2.5 421 246 1.71

Kimi at 5.86 J/token exceeds the commercial mid estimate (2.93 J/token) — not because the model is inefficient, but because the GPU is burning 928W for only 158 tok/s. The ≥5 tok/s threshold catches the worst idle cases but doesn't address underutilization.

The commercial J/token values from literature assume near-peak utilization. This asymmetry is noted in the methodology but not quantified. It makes NRP look worse than it would at full utilization, while partially counterbalanced by NRP not including PUE.

3. Commercial estimates are for frontier-scale models, NRP runs smaller ones

The prefill/decode energy bounds (1–10 / 0.5–3 J/token) are described as estimates for frontier-scale models (1T+ MoE) on H100 infrastructure. NRP runs models like Qwen3.5-397B, GLM-4.7-FP8 — large but not 1T+. The comparison partly conflates infrastructure efficiency with model-size differences.

4. Workload mix isn't validated against the selected preset

The commercial estimate is computed for a modeled workload (e.g., agentic: 40k in / 2k out). The NRP measurement reflects the actual workload. These happen to be similar right now, but the dashboard doesn't show whether the observed prompt:generation ratio matches the selected preset.

Possible improvements (for discussion)

  1. Show utilization context. Annotate models where per-token cost is likely dominated by idle overhead (e.g., when tok/s is low relative to the GPU's theoretical throughput). This helps users distinguish "inefficient model" from "underutilized GPU."

  2. Show observed workload mix. Display the actual prompt:generation ratio alongside the preset assumption, so users can judge whether the comparison is well-matched.

  3. Consider model-scale-appropriate baselines. The current commercial envelope is calibrated for 1T+ frontier models. A narrower, lower range for "similar-scale open models on commercial cloud" would make the comparison more precise (though harder to source).

  4. Clarify what the metric means. The methodology should be more explicit that NRP's CO₂/token is the real amortized cost including standby power, while the commercial estimate represents marginal cost at efficient utilization. These are both valid but different quantities.

  5. Market-based vs location-based grid intensity. Commercial cloud providers claim low carbon via RECs/PPAs (market-based accounting). NRP uses location-based (physical grid mix). The methodology should note which accounting frame applies to each side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions