Add Jina Test + GPT-5.4 to Lite unit-test leaderboard (69.9%) by keon · Pull Request #48 · logic-star-ai/swt-bench

keon · 2026-04-16T21:33:29Z

Summary

Adds Jina Test + GPT-5.4 to the SWT-Bench Lite unit-test leaderboard.

Field	Value
Method	Jina Test
Base Model	GPT-5.4 (`openai/gpt-5.4-2026-03-05`)
Score	69.9% F→P on SWT-Bench Lite unit-test mode
Organization	Om Labs (new entry)

Approach

Jina Test is a cross-validation proxy oracle pipeline built on OpenHands 1.16.1:

Test Generation — K=10 test candidates via 5 prompt strategies (direct, snippet, assertflip, contract, exception) × 2 temperatures
Fix Generation — J=3 fix candidates at temperatures (0.0, 0.5, 1.0)
CXV Matrix — 30-cell cross-validation matrix classified by a 10-verdict regex engine
Selection — frequency-rank: test with highest GOOD_FAIL count becomes proxy oracle; fix surviving that test is selected

The 276-instance submission keeps baseline predictions for 175 already-resolved instances and applies the CXV pipeline to the 100 failures. Verdict classification and selection are deterministic (no LLM).

Files changed

docs/approaches.csv — +Jina Test entry
docs/orgs.csv — +Om Labs org entry
docs/runs.csv — +69.9% result on lite unittest (2026-04-16)
docs/static/images/logos/omlabs.png — Om Labs logo

Artifacts

Project: https://github.com/omxyz/swt-bentch
Predictions (276 instances): https://github.com/omxyz/swt-bentch/blob/main/submission/predictions.jsonl
Agent traces (18 resolved instances): https://github.com/omxyz/swt-bentch/tree/main/trajectories
Paper: https://github.com/omxyz/swt-bentch/blob/main/arxiv/main.pdf
Reproduction: https://github.com/omxyz/swt-bentch/blob/main/SETUP.md

Notes

Coverage increase column is blank in runs.csv — happy to fill in once the official SWT-Bench harness run completes (in progress). Spot-checked a few instances with python -m src.main --filter_swt and patches apply correctly. Also sending via submit@swtbench.com in parallel for independent verification.

Happy to address any feedback before merge.

Method name: Jina Test Organization: Om Labs (new) Score: 69.9% F→P on SWT-Bench Lite unit-test mode Model: GPT-5.4 (openai/gpt-5.4-2026-03-05) Approach: Cross-validation proxy oracle — K=10 test candidates × J=3 fix candidates with frequency-rank selection Project: https://github.com/omxyz/swt-bentch Paper: https://github.com/omxyz/swt-bentch/blob/main/arxiv/main.pdf Traces: https://github.com/omxyz/swt-bentch/tree/main/trajectories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Jina Test + GPT-5.4 to Lite unit-test leaderboard (69.9%)#48

Add Jina Test + GPT-5.4 to Lite unit-test leaderboard (69.9%)#48
keon wants to merge 1 commit into
logic-star-ai:masterfrom
omxyz:add-jina-test

keon commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

keon commented Apr 16, 2026

Summary

Approach

Files changed

Artifacts

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant