Quality Evals that stress longer decode

I love the new Eval benchmarks idea and think you pick one that is less saturated to expose quantization gaps.
AIME 2024/2025  is less saturated and requires more thinking (but still no sandbox) and exposes more quantization gaps in my experiments than gsm8k or MMLU. It will take slightly longer to run but worth it!