📐 AMO-Bench: Large Language Models Still
Struggle in High School Math Competitions

This is the official repo for the paper AMO-Bench: Large Language Models Still Struggle in High School Math Competitions.

Updates

2026.02.05: Leaderboard Update: Qwen3-Max-Thinking achieves a new SOTA with 65.1%, while GLM-4.7 sets a new open-source record at 62.4%!
2025.12.01: We have added Token Efficiency showing the number of output tokens used by models in the leaderboard. Gemini 3 Pro achieves the highest token efficiency among top-performance models!
2025.11.24: Gemini 3 Pro achieves 63.1%, setting a new SOTA and breaking 60% for the first time! We have updated the Leaderboard with the results of Gemini 3 Pro and Qwen3-Max-Thinking (Preview).
2025.11.19: Kimi-K2-Thinking achieves 56.0%, new SOTA on Leaderboard!
2025.11.05: The problem statement of Problem 35 has been revised in 🤗 Huggingface Dataset: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the feedback from @applesilicon!
2025.10.31: We release the dataset, evaluation code, and technical report of AMO-Bench.

📊 Leaderboard

📈 Token Efficiency

📖 Abstract

We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation.

⭐ Key Features

Original problems. To prevent performance leaks from existing resources as much as possible, all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources.
Guaranteed difficulty. Each problem has undergone rigorous cross-validation by multiple experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models.
Final-answer based grading. Each problem in AMO-Bench requires a final answer rather than a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability.
Human-annotated reasoning paths. In addition to the final answer, each problem also includes a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis.

🛠️ Quick Start

Installation

Clone the repository:

git clone https://github.com/meituan-longcat/AMO-Bench.git
cd AMO-Bench

Install dependencies:

pip install -r requirements.txt

Running evaluations

Step 1: Format Model Response File

After obtaining model responses, format them as follows (one JSON object per line):

{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...

Save this file in the ./model_responses/ directory.

Step 2: Grading Responses

Set your API key and URL in lines 13-14 of utils.py. Then run:

python grading.py --response_file example.jsonl

Evaluation results will be saved under the ./grading_results/ directory.

Step 3 (Optional): Grade on AMO-Bench-P Subset

For a quick evaluation using only the parser-based subset (39 problems), run:

python grading.py --response_file example.jsonl --only_parser True

Discussions and Feedbacks

Here we summarize the discussions and feedbacks on AMO-Bench from the open-source community. We will regularly update the dataset to address urgent data issues.

We welcome any feedback you may have!

Problem 26 appears to be effectively the same as an existing contest problem. Thanks to @applesilicon to point this out!
The problem statement for Problem 35 should be further clarified: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the suggestions from @applesilicon!
Four problems involve complex numerical expressions (Problem 12, 13, 15 and 21). When tackling these problems, LLMs may struggle to perform accurate calculations without calling external tools. Thanks to the feedback from @prnake!
Problem 38 & 39 appear to be similar in content to two arXiv papers [1] [2].

🔎 Citation

If you find our work helpful or relevant to your research, please kindly cite our paper:

@misc{an2025amobench,
      title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions}, 
      author={Shengnan An and Xunliang Cai and Xuezhi Cao and Xiaoyu Li and Yehao Lin and Junlin Liu and Xinxuan Lv and Dan Ma and Xuanlin Wang and Ziwen Wang and Shuang Zhou},
      year={2025},
      eprint={2510.26768},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.26768}, 
}

🤗 Acknowledgement

The evaluation script utilizes Math-Verify to parse and verify model outputs. We greatly appreciate the contributors' efforts in providing this valuable tool.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📪 Support

For questions and support, please open an issue on GitHub or contact the maintainers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📐 AMO-Bench: Large Language Models Still
Struggle in High School Math Competitions

Updates

📊 Leaderboard

📈 Token Efficiency

📖 Abstract

⭐ Key Features

🛠️ Quick Start

Installation

Running evaluations

Step 1: Format Model Response File

Step 2: Grading Responses

Step 3 (Optional): Grade on AMO-Bench-P Subset

Discussions and Feedbacks

🔎 Citation

🤗 Acknowledgement

📜 License

📪 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
figures		figures
grading_results		grading_results
model_responses		model_responses
LICENSE		LICENSE
README.md		README.md
grading.py		grading.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

📐 AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Updates

📊 Leaderboard

📈 Token Efficiency

📖 Abstract

⭐ Key Features

🛠️ Quick Start

Installation

Running evaluations

Step 1: Format Model Response File

Step 2: Grading Responses

Step 3 (Optional): Grade on AMO-Bench-P Subset

Discussions and Feedbacks

🔎 Citation

🤗 Acknowledgement

📜 License

📪 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📐 AMO-Bench: Large Language Models Still
Struggle in High School Math Competitions

Packages