Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

Update benchmark scores across multiple leaderboards#302

Merged
insign merged 1 commit into
mainfrom
feature/update-benchmarks-12684852319776782862
Apr 19, 2026
Merged

Update benchmark scores across multiple leaderboards#302
insign merged 1 commit into
mainfrom
feature/update-benchmarks-12684852319776782862

Conversation

@insign

@insign insign commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

This PR updates the benchmark scores for several recent models (such as Claude 4.5/4.6, GPT-5 series, Gemini 3 series, DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3) using the latest verifiable data from official leaderboards including LiveBench, SWE-Bench, Humanity's Last Exam (HLE), FrontierMath, BFCL, OSWorld, and MMMU. This ensures the showdown database provides accurate and current comparative evaluations.

Changes:

  • LiveBench: Updated scores for Claude Sonnet 4.5 Thinking, DeepSeek V3.2 Thinking, Claude 4.5 Opus Medium Effort, and Claude Sonnet 4.5.
  • SWE-Bench Verified: Updated scores for thinking variants (Claude Opus 4.5/4.6, Gemini 3 Flash, MiniMax M2.5, GLM-5, GPT-5.2, Claude Sonnet 4.5, Kimi K2.5, DeepSeek V3.2) while setting base models to null for automatic inferior_of imputation as per UPDATE.md rules.
  • Humanity's Last Exam (HLE): Updated Gemini 3 Pro, GPT-5.2, Claude Opus 4.5 Thinking, Kimi K2.5, and GPT-5.1 High.
  • FrontierMath: Updated scores for Claude Opus 4.7/4.6, GPT-5.2, Gemini 3 Pro/Flash, GPT-5.1 High, and Claude Sonnet 4.6.
  • BFCL: Updated scores for Claude Opus/Sonnet 4.5, Gemini 3 Pro, GLM 4.6, Grok 4.1 (and thinking), o3, DeepSeek V3.2 Thinking, and Kimi K2.5 Thinking.
  • OSWorld: Updated Claude Sonnet 4.6, Kimi K2.5, and Claude Sonnet 4.5.
  • MMMU-Pro: Updated GPT-5.2/5.1 High, o3, Claude Opus/Sonnet 4.6, and Claude Opus/Sonnet 4.5.

Certain fields were purposefully left null where data was absent or explicitly reported only for the thinking variants, to maintain database integrity. Both meta.version and meta.last_update timestamps were correctly bumped.


PR created automatically by Jules for task 12684852319776782862 started by @insign

Sources verified:
- LiveBench leaderboard (livebench_data.txt)
- SWE-Bench Verified scores (swebench_data.txt)
- HLE leaderboard (hle_data.txt)
- FrontierMath leaderboard (epoch.ai/frontiermath)
- BFCL leaderboard (gorilla.cs.berkeley.edu/leaderboard.html)
- OSWorld leaderboard (os-world.github.io)
- MMMU-Pro leaderboard (mmmu-benchmark.github.io)

Changes include updating livebench, swe_bench, humanity_last_exam,
frontiermath, bfcl, osworld, and mmmu_pro values for recent models
such as Claude Opus/Sonnet 4.5/4.6, GPT-5 series, Gemini 3 series,
DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3.

Note: Single score reported for SWE-Bench families were assigned to
their thinking variants per UPDATE.md guidelines. Base models were
set to null for automatic inferior_of imputation.

Co-authored-by: insign <1113045+insign@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying showd with  Cloudflare Pages  Cloudflare Pages

Latest commit: b281743
Status: ✅  Deploy successful!
Preview URL: https://dc9fc857.showd.pages.dev
Branch Preview URL: https://feature-update-benchmarks-12.showd.pages.dev

View logs

@insign insign merged commit 71ad5b0 into main Apr 19, 2026
3 checks passed
@insign insign deleted the feature/update-benchmarks-12684852319776782862 branch April 19, 2026 20:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant