Skip to content

Add AiScientist MLE-Bench Lite results#142

Open
survivi wants to merge 1 commit intoopenai:mainfrom
survivi:aiscientist-lite-results
Open

Add AiScientist MLE-Bench Lite results#142
survivi wants to merge 1 commit intoopenai:mainfrom
survivi:aiscientist-lite-results

Conversation

@survivi
Copy link
Copy Markdown

@survivi survivi commented Apr 22, 2026

Leaderboard Submission Disclosure

  • The scaffold/harness used in this leaderboard submission does not expose to the agent signals derived from the held-out test set during rollouts.

Hello MLE-Bench team,

We are the AweAI Team, and we would like to submit the latest public evaluation results of our open-source framework, AiScientist.

AiScientist is described in our paper here:

As part of this pull request, we provide:

  • Three independent grading reports under runs/aiscientist_glm5_lite_group1-3
  • The corresponding run_group_experiments.csv entries
  • A short runs/README.md note for this experiment
  • A proposed leaderboard row for AiScientist

This submission is for MLE-Bench Lite, i.e. the 22-task low-complexity split.
This is also the MLE setting reported in our paper. We chose the Lite setting because it is the benchmark's recommended reduced-cost evaluation protocol while still enabling fair comparison on the Low == Lite column.

For this submission:

  • Framework: AiScientist
  • Team: AweAI Team
  • Model: GLM-5
  • Runtime budget: 24 hours per task
  • Hardware: 1 H20 GPU
  • Lite score (Low == Lite Any Medal %): 81.82 ± 0.00

We would like to be explicit that this PR only reports Lite / low-split results. We have therefore proposed placing this row under Additional Leaderboard Submissions with a Lite only note rather than presenting it as a full Low / Medium / High / All submission. If you would prefer a different formatting for Lite-only entries, we are happy to adjust the PR accordingly.

Regarding data visibility and evaluation integrity: during solving, the agent does not use held-out test-set signals. Only public competition data is exposed to the agent, while hidden / held-out information remains outside the agent-visible solving context and is only used for grading.

AiScientist is fully open-sourced, and the paper provides a system-level description of the framework, including its MLE-Bench Lite evaluation setting.

Thank you very much for building and maintaining MLE-Bench. We appreciate the benchmark and would be happy to revise the formatting or metadata if you have a preferred convention for Lite-only submissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant