Add AiScientist MLE-Bench Lite results by survivi · Pull Request #142 · openai/mle-bench

survivi · 2026-04-22T12:44:11Z

Leaderboard Submission Disclosure

The scaffold/harness used in this leaderboard submission does not expose to the agent signals derived from the held-out test set during rollouts.

Hello MLE-Bench team,

We are the AweAI Team, and we would like to submit the latest public evaluation results of our open-source framework, AiScientist.

AiScientist is described in our paper here:

Paper: https://arxiv.org/pdf/2604.13018

As part of this pull request, we provide:

Three independent grading reports under runs/aiscientist_glm5_lite_group1-3
The corresponding run_group_experiments.csv entries
A short runs/README.md note for this experiment
A proposed leaderboard row for AiScientist

This submission is for MLE-Bench Lite, i.e. the 22-task low-complexity split.
This is also the MLE setting reported in our paper. We chose the Lite setting because it is the benchmark's recommended reduced-cost evaluation protocol while still enabling fair comparison on the Low == Lite column.

For this submission:

Framework: AiScientist
Team: AweAI Team
Model: GLM-5
Runtime budget: 24 hours per task
Hardware: 1 H20 GPU
Lite score (Low == Lite Any Medal %): 81.82 ± 0.00

We would like to be explicit that this PR only reports Lite / low-split results. We have therefore proposed placing this row under Additional Leaderboard Submissions with a Lite only note rather than presenting it as a full Low / Medium / High / All submission. If you would prefer a different formatting for Lite-only entries, we are happy to adjust the PR accordingly.

Regarding data visibility and evaluation integrity: during solving, the agent does not use held-out test-set signals. Only public competition data is exposed to the agent, while hidden / held-out information remains outside the agent-visible solving context and is only used for grading.

AiScientist is fully open-sourced, and the paper provides a system-level description of the framework, including its MLE-Bench Lite evaluation setting.

Thank you very much for building and maintaining MLE-Bench. We appreciate the benchmark and would be happy to revise the formatting or metadata if you have a preferred convention for Lite-only submissions.

Add AiScientist MLE-Bench Lite results

802bb60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AiScientist MLE-Bench Lite results#142

Add AiScientist MLE-Bench Lite results#142
survivi wants to merge 1 commit intoopenai:mainfrom
survivi:aiscientist-lite-results

survivi commented Apr 22, 2026 •

edited by leon-openai

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

survivi commented Apr 22, 2026 • edited by leon-openai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Leaderboard Submission Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

survivi commented Apr 22, 2026 •

edited by leon-openai

Loading