Motivation
With the Opus-class comparison now grounded (#2), an interesting internal question remains: how does utilization intensity affect per-token energy efficiency across NRP's own models?
Current data shows non-obvious patterns:
| Model |
Power (W) |
GPUs |
Total tok/s |
J/token |
| gpt-oss-120b |
597 |
2× A6000 |
2341 |
0.26 |
| gemma-4-31B |
577 |
2× A6000 |
360 |
1.60 |
| Qwen3.5-397B |
783 |
8× A100 |
18 |
43.5 |
| Kimi-K2.5 |
670 |
8× A100 |
~0 |
idle |
| MiniMax-M2.5 |
447 |
4× A100 |
~0 |
idle |
Observations:
- Low-utilization models on large GPU allocations (Qwen 397B at 18 tok/s on 8× A100) have extremely high J/token — the GPUs burn ~780W regardless of traffic
- High-utilization models on small allocations (gpt-oss at 2341 tok/s on 2× A6000) achieve very low J/token
- Several models sit idle drawing hundreds of watts (Kimi, MiniMax, Olmo) — these contribute CO₂ with zero token output
- Power doesn't scale linearly with throughput: gemma at 360 tok/s uses nearly the same watts as gpt-oss at 2341 tok/s (both on 2× A6000), suggesting the GPU power floor dominates at moderate utilization
Proposed analysis / dashboard features
-
GPU efficiency curve per model: Plot J/token vs tok/s over time (from history data). This would show each model's efficiency curve — how much per-token cost drops as utilization increases, and where the diminishing returns are.
-
Idle power attribution: Show what fraction of each model's CO₂/hr is "idle overhead" vs "productive work." Approximation: idle_fraction ≈ (GPU_count × idle_watts_per_GPU_type) / measured_power. The remainder is marginal energy actually used for inference.
-
Right-sizing signal: Flag models where the GPU allocation seems oversized for the actual traffic. e.g., 8× A100 serving 18 tok/s could potentially be served on fewer GPUs (if the model fits), dramatically improving J/token.
-
Consolidation opportunities: Identify whether idle models could share hardware or be spun down to save energy.
Non-goals
This is about NRP's own fleet efficiency, not the Opus-class comparison. The frontier comparison is resolved in the current dashboard. This issue is about helping NRP operators understand and optimize their own GPU utilization patterns.
Motivation
With the Opus-class comparison now grounded (#2), an interesting internal question remains: how does utilization intensity affect per-token energy efficiency across NRP's own models?
Current data shows non-obvious patterns:
Observations:
Proposed analysis / dashboard features
GPU efficiency curve per model: Plot J/token vs tok/s over time (from history data). This would show each model's efficiency curve — how much per-token cost drops as utilization increases, and where the diminishing returns are.
Idle power attribution: Show what fraction of each model's CO₂/hr is "idle overhead" vs "productive work." Approximation:
idle_fraction ≈ (GPU_count × idle_watts_per_GPU_type) / measured_power. The remainder is marginal energy actually used for inference.Right-sizing signal: Flag models where the GPU allocation seems oversized for the actual traffic. e.g., 8× A100 serving 18 tok/s could potentially be served on fewer GPUs (if the model fits), dramatically improving J/token.
Consolidation opportunities: Identify whether idle models could share hardware or be spun down to save energy.
Non-goals
This is about NRP's own fleet efficiency, not the Opus-class comparison. The frontier comparison is resolved in the current dashboard. This issue is about helping NRP operators understand and optimize their own GPU utilization patterns.