Skip to content

Model-specific utilization vs per-token energy efficiency analysis #3

@cboettig

Description

@cboettig

Motivation

With the Opus-class comparison now grounded (#2), an interesting internal question remains: how does utilization intensity affect per-token energy efficiency across NRP's own models?

Current data shows non-obvious patterns:

Model Power (W) GPUs Total tok/s J/token
gpt-oss-120b 597 2× A6000 2341 0.26
gemma-4-31B 577 2× A6000 360 1.60
Qwen3.5-397B 783 8× A100 18 43.5
Kimi-K2.5 670 8× A100 ~0 idle
MiniMax-M2.5 447 4× A100 ~0 idle

Observations:

  • Low-utilization models on large GPU allocations (Qwen 397B at 18 tok/s on 8× A100) have extremely high J/token — the GPUs burn ~780W regardless of traffic
  • High-utilization models on small allocations (gpt-oss at 2341 tok/s on 2× A6000) achieve very low J/token
  • Several models sit idle drawing hundreds of watts (Kimi, MiniMax, Olmo) — these contribute CO₂ with zero token output
  • Power doesn't scale linearly with throughput: gemma at 360 tok/s uses nearly the same watts as gpt-oss at 2341 tok/s (both on 2× A6000), suggesting the GPU power floor dominates at moderate utilization

Proposed analysis / dashboard features

  1. GPU efficiency curve per model: Plot J/token vs tok/s over time (from history data). This would show each model's efficiency curve — how much per-token cost drops as utilization increases, and where the diminishing returns are.

  2. Idle power attribution: Show what fraction of each model's CO₂/hr is "idle overhead" vs "productive work." Approximation: idle_fraction ≈ (GPU_count × idle_watts_per_GPU_type) / measured_power. The remainder is marginal energy actually used for inference.

  3. Right-sizing signal: Flag models where the GPU allocation seems oversized for the actual traffic. e.g., 8× A100 serving 18 tok/s could potentially be served on fewer GPUs (if the model fits), dramatically improving J/token.

  4. Consolidation opportunities: Identify whether idle models could share hardware or be spun down to save energy.

Non-goals

This is about NRP's own fleet efficiency, not the Opus-class comparison. The frontier comparison is resolved in the current dashboard. This issue is about helping NRP operators understand and optimize their own GPU utilization patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions