Skip to content

Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported)#24

Open
tomascupr wants to merge 1 commit into
xlang-ai:mainfrom
tomascupr:results/duvo-eye-1
Open

Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported)#24
tomascupr wants to merge 1 commit into
xlang-ai:mainfrom
tomascupr:results/duvo-eye-1

Conversation

@tomascupr

Copy link
Copy Markdown

Community-reported results: duvo-eye-1

This PR adds a small "Community-reported results" table under the Benchmark section, seeded with our model. If you'd prefer a different format (or an issue/discussion instead of a README table), happy to adapt — there is currently no documented way to report results on OSWorld-G.

Model: duvoai/duvo-eye-1 (public weights) — Holo-3.1-35B-A3B (3B active) with a LoRA adapter (rank 64, alpha 128, 1 epoch) trained on duvoai/SynthUI (14.9k synthetic enterprise-UI rows). No benchmark data in training.

Results on benchmark/OSWorld-G.json

Split Accuracy
Standard subset (510 samples, refusal excluded) 78.0 (398/510)
Full set (564 samples, all 54 refusal items scored as misses) 70.6 (398/564)

Capability breakdown (joined with benchmark/classification_result.json + benchmark/buckets.json, refusal rows excluded): Text Matching 85.0 (204/240), Element Recognition 84.3 (258/306), Layout Understanding 81.6 (195/239), Fine-grained Manipulation 68.9 (91/132).

Base model (Holo-3.1-35B-A3B) under the identical harness: 64.9 on the standard subset; among answered samples only, base 75.4 vs duvo-eye-1 78.0. duvo-eye-1 had 0.0% unparseable outputs.

Protocol (honest notes)

  • Self-reported, single-shot, temperature 0, max_tokens 64, thinking disabled (chat_template_kwargs: {"enable_thinking": false}), guided JSON decoding into {"x": int, "y": int} in [0, 1000], scaled to absolute pixels via original image size; scored as a hit iff the point falls inside the ground-truth region. Unparseable output = miss.
  • Evaluated with our own harness (bench_eval.py, included in the predictions dataset), not this repo's evaluation/ scripts.
  • Because decoding is constrained to a coordinate, the model cannot earn refusal credit — hence both numbers above; the 564-sample figure is the one comparable to the paper's "Overall" column.
  • Per-sample predictions (raw output, parsed point, hit/miss) for both duvo-eye-1 and the base model are published for independent rescoring: duvoai/duvo-eye-1-evals (osworld-g/).

@tomascupr

Copy link
Copy Markdown
Author

Update: the full-564 number is now measured, not derived — we ran all 564 samples including the 54 refusal items: 70.6 (398/564), matching the arithmetic in the PR body exactly. Per-sample predictions added to the same public dataset (osworld-g/bench_oswg564_duvo-eye-1.predictions.jsonl). We also completed ScreenSpot-v2 (95.05, 1,272 samples) and ScreenSpot-Pro (72.9, submitted to that repo as likaixin2000/ScreenSpot-Pro-GUI-Grounding#29), so the model now has a full row for the three-benchmark comparison table on the project site.

@tomascupr

Copy link
Copy Markdown
Author

Friendly follow-up for the maintainers (@Timothyxxx and team): since there isn't a documented way to report community results on OSWorld-G, what format would you prefer — this README table, a discussion, or a verification run on your side? Happy to match whatever is easiest to trust.

For completeness, we also evaluated duvo-eye-1 on the refined-instruction split (benchmark/OSWorld-G_refined.json): 75.0 on the full 564 (refusals scored as misses) and 82.9 on the 510 non-refusal subset, vs 70.6 / 78.0 on the original instructions. All per-sample predictions for both splits — and the base model — are public for independent rescoring at https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (osworld-g/ and osworld-g-refined/).

Glad to adapt the PR or provide anything that helps verification.

@tomascupr

Copy link
Copy Markdown
Author

Reproduced under your own evaluation code: I ran duvo-eye-1 through evaluation/eval.py's scorer (point-in-rectangle / point-in-polygon / refusal, with the relative-coordinate scaling your _eval applies). Results match what's in this PR:

510 standard subset: 78.0%   |   full 564 (refusals as miss): 70.6%
by type: bbox 76.4 (470), polygon 97.5 (40), refusal 0.0 (54)

Official result file added to the public evidence dataset: https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (osworld-g/oswg_official_harness.json), next to the per-sample predictions. Happy to run anything else that helps verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant