Model-specific utilization vs per-token energy efficiency analysis

## Motivation

With the Opus-class comparison now grounded (#2), an interesting internal question remains: how does utilization intensity affect per-token energy efficiency *across NRP's own models*?

Current data shows non-obvious patterns:

| Model | Power (W) | GPUs | Total tok/s | J/token |
|-------|-----------|------|-------------|---------|
| gpt-oss-120b | 597 | 2× A6000 | 2341 | 0.26 |
| gemma-4-31B | 577 | 2× A6000 | 360 | 1.60 |
| Qwen3.5-397B | 783 | 8× A100 | 18 | 43.5 |
| Kimi-K2.5 | 670 | 8× A100 | ~0 | idle |
| MiniMax-M2.5 | 447 | 4× A100 | ~0 | idle |

Observations:
- **Low-utilization models on large GPU allocations** (Qwen 397B at 18 tok/s on 8× A100) have extremely high J/token — the GPUs burn ~780W regardless of traffic
- **High-utilization models on small allocations** (gpt-oss at 2341 tok/s on 2× A6000) achieve very low J/token
- Several models sit idle drawing hundreds of watts (Kimi, MiniMax, Olmo) — these contribute CO₂ with zero token output
- Power doesn't scale linearly with throughput: gemma at 360 tok/s uses nearly the same watts as gpt-oss at 2341 tok/s (both on 2× A6000), suggesting the GPU power floor dominates at moderate utilization

## Proposed analysis / dashboard features

1. **GPU efficiency curve per model**: Plot J/token vs tok/s over time (from history data). This would show each model's efficiency curve — how much per-token cost drops as utilization increases, and where the diminishing returns are.

2. **Idle power attribution**: Show what fraction of each model's CO₂/hr is "idle overhead" vs "productive work." Approximation: `idle_fraction ≈ (GPU_count × idle_watts_per_GPU_type) / measured_power`. The remainder is marginal energy actually used for inference.

3. **Right-sizing signal**: Flag models where the GPU allocation seems oversized for the actual traffic. e.g., 8× A100 serving 18 tok/s could potentially be served on fewer GPUs (if the model fits), dramatically improving J/token.

4. **Consolidation opportunities**: Identify whether idle models could share hardware or be spun down to save energy.

## Non-goals

This is about NRP's own fleet efficiency, not the Opus-class comparison. The frontier comparison is resolved in the current dashboard. This issue is about helping NRP operators understand and optimize their own GPU utilization patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model-specific utilization vs per-token energy efficiency analysis #3

Motivation

Proposed analysis / dashboard features

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Power (W)	GPUs	Total tok/s	J/token
gpt-oss-120b	597	2× A6000	2341	0.26
gemma-4-31B	577	2× A6000	360	1.60
Qwen3.5-397B	783	8× A100	18	43.5
Kimi-K2.5	670	8× A100	~0	idle
MiniMax-M2.5	447	4× A100	~0	idle

Model-specific utilization vs per-token energy efficiency analysis #3

Description

Motivation

Proposed analysis / dashboard features

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions