Update benchmark scores across multiple leaderboards by insign · Pull Request #302 · verseles/showdown

insign · 2026-04-19T15:49:09Z

This PR updates the benchmark scores for several recent models (such as Claude 4.5/4.6, GPT-5 series, Gemini 3 series, DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3) using the latest verifiable data from official leaderboards including LiveBench, SWE-Bench, Humanity's Last Exam (HLE), FrontierMath, BFCL, OSWorld, and MMMU. This ensures the showdown database provides accurate and current comparative evaluations.

Changes:

LiveBench: Updated scores for Claude Sonnet 4.5 Thinking, DeepSeek V3.2 Thinking, Claude 4.5 Opus Medium Effort, and Claude Sonnet 4.5.
SWE-Bench Verified: Updated scores for thinking variants (Claude Opus 4.5/4.6, Gemini 3 Flash, MiniMax M2.5, GLM-5, GPT-5.2, Claude Sonnet 4.5, Kimi K2.5, DeepSeek V3.2) while setting base models to null for automatic inferior_of imputation as per UPDATE.md rules.
Humanity's Last Exam (HLE): Updated Gemini 3 Pro, GPT-5.2, Claude Opus 4.5 Thinking, Kimi K2.5, and GPT-5.1 High.
FrontierMath: Updated scores for Claude Opus 4.7/4.6, GPT-5.2, Gemini 3 Pro/Flash, GPT-5.1 High, and Claude Sonnet 4.6.
BFCL: Updated scores for Claude Opus/Sonnet 4.5, Gemini 3 Pro, GLM 4.6, Grok 4.1 (and thinking), o3, DeepSeek V3.2 Thinking, and Kimi K2.5 Thinking.
OSWorld: Updated Claude Sonnet 4.6, Kimi K2.5, and Claude Sonnet 4.5.
MMMU-Pro: Updated GPT-5.2/5.1 High, o3, Claude Opus/Sonnet 4.6, and Claude Opus/Sonnet 4.5.

Certain fields were purposefully left null where data was absent or explicitly reported only for the thinking variants, to maintain database integrity. Both meta.version and meta.last_update timestamps were correctly bumped.

PR created automatically by Jules for task 12684852319776782862 started by @insign

Sources verified: - LiveBench leaderboard (livebench_data.txt) - SWE-Bench Verified scores (swebench_data.txt) - HLE leaderboard (hle_data.txt) - FrontierMath leaderboard (epoch.ai/frontiermath) - BFCL leaderboard (gorilla.cs.berkeley.edu/leaderboard.html) - OSWorld leaderboard (os-world.github.io) - MMMU-Pro leaderboard (mmmu-benchmark.github.io) Changes include updating livebench, swe_bench, humanity_last_exam, frontiermath, bfcl, osworld, and mmmu_pro values for recent models such as Claude Opus/Sonnet 4.5/4.6, GPT-5 series, Gemini 3 series, DeepSeek V3.2, Grok 4.1, Kimi K2.5, and o3. Note: Single score reported for SWE-Bench families were assigned to their thinking variants per UPDATE.md guidelines. Base models were set to null for automatic inferior_of imputation. Co-authored-by: insign <1113045+insign@users.noreply.github.com>

google-labs-jules · 2026-04-19T15:49:11Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

chatgpt-codex-connector · 2026-04-19T15:49:15Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

cloudflare-workers-and-pages · 2026-04-19T15:49:55Z

Deploying showd with Cloudflare Pages

Latest commit:	`b281743`
Status:	✅ Deploy successful!
Preview URL:	https://dc9fc857.showd.pages.dev
Branch Preview URL:	https://feature-update-benchmarks-12.showd.pages.dev

View logs

insign merged commit 71ad5b0 into main Apr 19, 2026
3 checks passed

insign deleted the feature/update-benchmarks-12684852319776782862 branch April 19, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update benchmark scores across multiple leaderboards#302

Update benchmark scores across multiple leaderboards#302
insign merged 1 commit into
mainfrom
feature/update-benchmarks-12684852319776782862

insign commented Apr 19, 2026

Uh oh!

google-labs-jules Bot commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

insign commented Apr 19, 2026

Uh oh!

google-labs-jules Bot commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 19, 2026

Deploying showd with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant