🚀 The feature, motivation and pitch
As we add support for more devices, benchmarking scripts increasingly rely on device-specific tensor shapes due to VRAM constraints. This leads to fragmented benchmark setups and scattered performance results that are difficult to compare or reproduce. To make benchmarking more scalable and consistent, I propose introducing a standardized benchmark model configuration.
Motivation
Today, different kernels (and sometimes different devices) use ad-hoc shapes, which:
- Require device-specific dispatching in benchmark scripts
- Produce results that are hard to compare across hardware
- Increase maintenance burden for contributors adding new benchmarks
Proposal
Define one or more representative “mainstream” model profiles (e.g., LLaMA-/GPT-like) with canonical parameters such as: hidden_size, vocab_size, num_q_heads, num_kv_heads, etc.
All benchmark scripts would derive their shapes from this shared config, optionally using scaled-down subsets when needed to fit memory constraints, instead of inventing per-device shapes.
This would:
- Eliminate ad-hoc device-specific shapes
- Improve comparability across kernels and hardware
- Lower the barrier for contributors writing new benchmarks
- Improve reproducibility of performance results
Target Devices
Before finalizing the standardized config, it would be helpful to clarify which devices we officially want to support for benchmarking.
Here’s the list we currently have:
- NVIDIA H100 (80GB)
- Intel XPU GPU Max 1100 (48GB)
- NPU Atlas 900 A2 POD (64G)
Please feel free to comment if I missed any devices, or if there are additional targets we should consider.
Next Steps
🚀 The feature, motivation and pitch
As we add support for more devices, benchmarking scripts increasingly rely on device-specific tensor shapes due to VRAM constraints. This leads to fragmented benchmark setups and scattered performance results that are difficult to compare or reproduce. To make benchmarking more scalable and consistent, I propose introducing a standardized benchmark model configuration.
Motivation
Today, different kernels (and sometimes different devices) use ad-hoc shapes, which:
Proposal
Define one or more representative “mainstream” model profiles (e.g., LLaMA-/GPT-like) with canonical parameters such as:
hidden_size,vocab_size,num_q_heads,num_kv_heads, etc.All benchmark scripts would derive their shapes from this shared config, optionally using scaled-down subsets when needed to fit memory constraints, instead of inventing per-device shapes.
This would:
Target Devices
Before finalizing the standardized config, it would be helpful to clarify which devices we officially want to support for benchmarking.
Here’s the list we currently have:
Please feel free to comment if I missed any devices, or if there are additional targets we should consider.
Next Steps