Methodology: blended CO₂/token metric hides workload and utilization effects

## Context

Following the work in #1 (now closed), the dashboard correctly measures GPU power from DCGM and divides by total token throughput (prompt + generation) from vLLM. The commercial baseline uses a physics-constrained envelope with separate prefill/decode J/token estimates blended by workload preset.

A close examination of real Prometheus data and the comparison logic reveals several areas where the methodology could be more transparent or more precise.

## Findings

### 1. NRP measurement is sound but inherently blended

DCGM reports a single power reading per GPU. We **cannot** decompose it into "watts spent on prefill" vs "watts spent on decode." The formula:

```
CO₂/token = avg_power × grid_intensity / total_tokens_per_sec
```

gives the real amortized energy cost per token — honest carbon accounting. But it's a weighted average whose value depends on the workload mix (prompt-heavy vs generation-heavy).

Real NRP data shows the current workload is heavily prompt-dominated (83–99% prompt tokens across active models), which is consistent with the agentic preset's 95.2%. The blending is not currently distorting the comparison, but could if workload patterns shift.

### 2. Idle/underutilization inflates NRP's per-token cost

The NRP metric includes idle GPU power in the numerator. Example from current data:

| Model | Power (W) | Total tok/s | J/token |
|-------|-----------|-------------|---------|
| GLM-4.7-FP8 | 1320 | 3158 | 0.42 |
| Qwen3.5-397B | 1235 | 2392 | 0.52 |
| GPT-OSS-120B | 597 | 2937 | 0.20 |
| Kimi-K2.5 | 928 | 158 | 5.86 |
| MiniMax-M2.5 | 421 | 246 | 1.71 |

Kimi at 5.86 J/token exceeds the commercial mid estimate (2.93 J/token) — not because the model is inefficient, but because the GPU is burning 928W for only 158 tok/s. The ≥5 tok/s threshold catches the worst idle cases but doesn't address underutilization.

The commercial J/token values from literature assume near-peak utilization. This asymmetry is noted in the methodology but not quantified. It makes NRP look worse than it would at full utilization, while partially counterbalanced by NRP not including PUE.

### 3. Commercial estimates are for frontier-scale models, NRP runs smaller ones

The prefill/decode energy bounds (1–10 / 0.5–3 J/token) are described as estimates for **frontier-scale models (1T+ MoE)** on H100 infrastructure. NRP runs models like Qwen3.5-397B, GLM-4.7-FP8 — large but not 1T+. The comparison partly conflates infrastructure efficiency with model-size differences.

### 4. Workload mix isn't validated against the selected preset

The commercial estimate is computed for a *modeled* workload (e.g., agentic: 40k in / 2k out). The NRP measurement reflects the *actual* workload. These happen to be similar right now, but the dashboard doesn't show whether the observed prompt:generation ratio matches the selected preset.

## Possible improvements (for discussion)

1. **Show utilization context.** Annotate models where per-token cost is likely dominated by idle overhead (e.g., when tok/s is low relative to the GPU's theoretical throughput). This helps users distinguish "inefficient model" from "underutilized GPU."

2. **Show observed workload mix.** Display the actual prompt:generation ratio alongside the preset assumption, so users can judge whether the comparison is well-matched.

3. **Consider model-scale-appropriate baselines.** The current commercial envelope is calibrated for 1T+ frontier models. A narrower, lower range for "similar-scale open models on commercial cloud" would make the comparison more precise (though harder to source).

4. **Clarify what the metric means.** The methodology should be more explicit that NRP's CO₂/token is the *real amortized cost including standby power*, while the commercial estimate represents *marginal cost at efficient utilization*. These are both valid but different quantities.

5. **Market-based vs location-based grid intensity.** Commercial cloud providers claim low carbon via RECs/PPAs (market-based accounting). NRP uses location-based (physical grid mix). The methodology should note which accounting frame applies to each side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology: blended CO₂/token metric hides workload and utilization effects #2

Context

Findings

1. NRP measurement is sound but inherently blended

2. Idle/underutilization inflates NRP's per-token cost

3. Commercial estimates are for frontier-scale models, NRP runs smaller ones

4. Workload mix isn't validated against the selected preset

Possible improvements (for discussion)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Power (W)	Total tok/s	J/token
GLM-4.7-FP8	1320	3158	0.42
Qwen3.5-397B	1235	2392	0.52
GPT-OSS-120B	597	2937	0.20
Kimi-K2.5	928	158	5.86
MiniMax-M2.5	421	246	1.71

Methodology: blended CO₂/token metric hides workload and utilization effects #2

Description

Context

Findings

1. NRP measurement is sound but inherently blended

2. Idle/underutilization inflates NRP's per-token cost

3. Commercial estimates are for frontier-scale models, NRP runs smaller ones

4. Workload mix isn't validated against the selected preset

Possible improvements (for discussion)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions