Skip to content

Add Jina Test + GPT-5.4 to Lite unit-test leaderboard (69.9%)#48

Open
keon wants to merge 1 commit into
logic-star-ai:masterfrom
omxyz:add-jina-test
Open

Add Jina Test + GPT-5.4 to Lite unit-test leaderboard (69.9%)#48
keon wants to merge 1 commit into
logic-star-ai:masterfrom
omxyz:add-jina-test

Conversation

@keon

@keon keon commented Apr 16, 2026

Copy link
Copy Markdown

Summary

Adds Jina Test + GPT-5.4 to the SWT-Bench Lite unit-test leaderboard.

Field Value
Method Jina Test
Base Model GPT-5.4 (openai/gpt-5.4-2026-03-05)
Score 69.9% F→P on SWT-Bench Lite unit-test mode
Organization Om Labs (new entry)

Approach

Jina Test is a cross-validation proxy oracle pipeline built on OpenHands 1.16.1:

  1. Test Generation — K=10 test candidates via 5 prompt strategies (direct, snippet, assertflip, contract, exception) × 2 temperatures
  2. Fix Generation — J=3 fix candidates at temperatures (0.0, 0.5, 1.0)
  3. CXV Matrix — 30-cell cross-validation matrix classified by a 10-verdict regex engine
  4. Selection — frequency-rank: test with highest GOOD_FAIL count becomes proxy oracle; fix surviving that test is selected

The 276-instance submission keeps baseline predictions for 175 already-resolved instances and applies the CXV pipeline to the 100 failures. Verdict classification and selection are deterministic (no LLM).

Files changed

  • docs/approaches.csv — +Jina Test entry
  • docs/orgs.csv — +Om Labs org entry
  • docs/runs.csv — +69.9% result on lite unittest (2026-04-16)
  • docs/static/images/logos/omlabs.png — Om Labs logo

Artifacts

Notes

Coverage increase column is blank in runs.csv — happy to fill in once the official SWT-Bench harness run completes (in progress). Spot-checked a few instances with python -m src.main --filter_swt and patches apply correctly. Also sending via submit@swtbench.com in parallel for independent verification.

Happy to address any feedback before merge.

Method name: Jina Test
Organization: Om Labs (new)
Score: 69.9% F→P on SWT-Bench Lite unit-test mode
Model: GPT-5.4 (openai/gpt-5.4-2026-03-05)
Approach: Cross-validation proxy oracle — K=10 test candidates × J=3 fix
candidates with frequency-rank selection

Project: https://github.com/omxyz/swt-bentch
Paper: https://github.com/omxyz/swt-bentch/blob/main/arxiv/main.pdf
Traces: https://github.com/omxyz/swt-bentch/tree/main/trajectories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant