Skip to content

Quality Evals that stress longer decode #1920

Description

@sshleifer

I love the new Eval benchmarks idea and think you pick one that is less saturated to expose quantization gaps.
AIME 2024/2025 is less saturated and requires more thinking (but still no sandbox) and exposes more quantization gaps in my experiments than gsm8k or MMLU. It will take slightly longer to run but worth it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions