Update benchmark scores across multiple leaderboards#302
Conversation
Sources verified: - LiveBench leaderboard (livebench_data.txt) - SWE-Bench Verified scores (swebench_data.txt) - HLE leaderboard (hle_data.txt) - FrontierMath leaderboard (epoch.ai/frontiermath) - BFCL leaderboard (gorilla.cs.berkeley.edu/leaderboard.html) - OSWorld leaderboard (os-world.github.io) - MMMU-Pro leaderboard (mmmu-benchmark.github.io) Changes include updating livebench, swe_bench, humanity_last_exam, frontiermath, bfcl, osworld, and mmmu_pro values for recent models such as Claude Opus/Sonnet 4.5/4.6, GPT-5 series, Gemini 3 series, DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3. Note: Single score reported for SWE-Bench families were assigned to their thinking variants per UPDATE.md guidelines. Base models were set to null for automatic inferior_of imputation. Co-authored-by: insign <1113045+insign@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Deploying showd with
|
| Latest commit: |
b281743
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://dc9fc857.showd.pages.dev |
| Branch Preview URL: | https://feature-update-benchmarks-12.showd.pages.dev |
This PR updates the benchmark scores for several recent models (such as Claude 4.5/4.6, GPT-5 series, Gemini 3 series, DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3) using the latest verifiable data from official leaderboards including LiveBench, SWE-Bench, Humanity's Last Exam (HLE), FrontierMath, BFCL, OSWorld, and MMMU. This ensures the showdown database provides accurate and current comparative evaluations.
Changes:
Certain fields were purposefully left
nullwhere data was absent or explicitly reported only for the thinking variants, to maintain database integrity. Bothmeta.versionandmeta.last_updatetimestamps were correctly bumped.PR created automatically by Jules for task 12684852319776782862 started by @insign