Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported) by tomascupr · Pull Request #24 · xlang-ai/OSWorld-G

tomascupr · 2026-06-12T20:06:18Z

Community-reported results: duvo-eye-1

This PR adds a small "Community-reported results" table under the Benchmark section, seeded with our model. If you'd prefer a different format (or an issue/discussion instead of a README table), happy to adapt — there is currently no documented way to report results on OSWorld-G.

Model: duvoai/duvo-eye-1 (public weights) — Holo-3.1-35B-A3B (3B active) with a LoRA adapter (rank 64, alpha 128, 1 epoch) trained on duvoai/SynthUI (14.9k synthetic enterprise-UI rows). No benchmark data in training.

Results on `benchmark/OSWorld-G.json`

Split	Accuracy
Standard subset (510 samples, refusal excluded)	78.0 (398/510)
Full set (564 samples, all 54 refusal items scored as misses)	70.6 (398/564)

Capability breakdown (joined with benchmark/classification_result.json + benchmark/buckets.json, refusal rows excluded): Text Matching 85.0 (204/240), Element Recognition 84.3 (258/306), Layout Understanding 81.6 (195/239), Fine-grained Manipulation 68.9 (91/132).

Base model (Holo-3.1-35B-A3B) under the identical harness: 64.9 on the standard subset; among answered samples only, base 75.4 vs duvo-eye-1 78.0. duvo-eye-1 had 0.0% unparseable outputs.

Protocol (honest notes)

Self-reported, single-shot, temperature 0, max_tokens 64, thinking disabled (chat_template_kwargs: {"enable_thinking": false}), guided JSON decoding into {"x": int, "y": int} in [0, 1000], scaled to absolute pixels via original image size; scored as a hit iff the point falls inside the ground-truth region. Unparseable output = miss.
Evaluated with our own harness (bench_eval.py, included in the predictions dataset), not this repo's evaluation/ scripts.
Because decoding is constrained to a coordinate, the model cannot earn refusal credit — hence both numbers above; the 564-sample figure is the one comparable to the paper's "Overall" column.
Per-sample predictions (raw output, parsed point, hit/miss) for both duvo-eye-1 and the base model are published for independent rescoring: duvoai/duvo-eye-1-evals (osworld-g/).

tomascupr · 2026-06-12T21:39:40Z

Update: the full-564 number is now measured, not derived — we ran all 564 samples including the 54 refusal items: 70.6 (398/564), matching the arithmetic in the PR body exactly. Per-sample predictions added to the same public dataset (osworld-g/bench_oswg564_duvo-eye-1.predictions.jsonl). We also completed ScreenSpot-v2 (95.05, 1,272 samples) and ScreenSpot-Pro (72.9, submitted to that repo as likaixin2000/ScreenSpot-Pro-GUI-Grounding#29), so the model now has a full row for the three-benchmark comparison table on the project site.

tomascupr · 2026-06-13T07:13:48Z

Friendly follow-up for the maintainers (@Timothyxxx and team): since there isn't a documented way to report community results on OSWorld-G, what format would you prefer — this README table, a discussion, or a verification run on your side? Happy to match whatever is easiest to trust.

For completeness, we also evaluated duvo-eye-1 on the refined-instruction split (benchmark/OSWorld-G_refined.json): 75.0 on the full 564 (refusals scored as misses) and 82.9 on the 510 non-refusal subset, vs 70.6 / 78.0 on the original instructions. All per-sample predictions for both splits — and the base model — are public for independent rescoring at https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (osworld-g/ and osworld-g-refined/).

Glad to adapt the PR or provide anything that helps verification.

tomascupr · 2026-06-13T08:10:11Z

Reproduced under your own evaluation code: I ran duvo-eye-1 through evaluation/eval.py's scorer (point-in-rectangle / point-in-polygon / refusal, with the relative-coordinate scaling your _eval applies). Results match what's in this PR:

510 standard subset: 78.0%   |   full 564 (refusals as miss): 70.6%
by type: bbox 76.4 (470), polygon 97.5 (40), refusal 0.0 (54)

Official result file added to the public evidence dataset: https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (osworld-g/oswg_official_harness.json), next to the per-sample predictions. Happy to run anything else that helps verification.

Add community-reported results section (duvo-eye-1)

7c96c45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported)#24

Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported)#24
tomascupr wants to merge 1 commit into
xlang-ai:mainfrom
tomascupr:results/duvo-eye-1

tomascupr commented Jun 12, 2026

Uh oh!

tomascupr commented Jun 12, 2026

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomascupr commented Jun 12, 2026

Community-reported results: duvo-eye-1

Results on benchmark/OSWorld-G.json

Protocol (honest notes)

Uh oh!

tomascupr commented Jun 12, 2026

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results on `benchmark/OSWorld-G.json`