Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported)#24
Conversation
|
Update: the full-564 number is now measured, not derived — we ran all 564 samples including the 54 refusal items: 70.6 (398/564), matching the arithmetic in the PR body exactly. Per-sample predictions added to the same public dataset ( |
|
Friendly follow-up for the maintainers (@Timothyxxx and team): since there isn't a documented way to report community results on OSWorld-G, what format would you prefer — this README table, a discussion, or a verification run on your side? Happy to match whatever is easiest to trust. For completeness, we also evaluated duvo-eye-1 on the refined-instruction split ( Glad to adapt the PR or provide anything that helps verification. |
|
Reproduced under your own evaluation code: I ran duvo-eye-1 through Official result file added to the public evidence dataset: https://huggingface.co/datasets/duvoai/duvo-eye-1-evals ( |
Community-reported results: duvo-eye-1
This PR adds a small "Community-reported results" table under the Benchmark section, seeded with our model. If you'd prefer a different format (or an issue/discussion instead of a README table), happy to adapt — there is currently no documented way to report results on OSWorld-G.
Model: duvoai/duvo-eye-1 (public weights) — Holo-3.1-35B-A3B (3B active) with a LoRA adapter (rank 64, alpha 128, 1 epoch) trained on duvoai/SynthUI (14.9k synthetic enterprise-UI rows). No benchmark data in training.
Results on
benchmark/OSWorld-G.jsonCapability breakdown (joined with
benchmark/classification_result.json+benchmark/buckets.json, refusal rows excluded): Text Matching 85.0 (204/240), Element Recognition 84.3 (258/306), Layout Understanding 81.6 (195/239), Fine-grained Manipulation 68.9 (91/132).Base model (Holo-3.1-35B-A3B) under the identical harness: 64.9 on the standard subset; among answered samples only, base 75.4 vs duvo-eye-1 78.0. duvo-eye-1 had 0.0% unparseable outputs.
Protocol (honest notes)
chat_template_kwargs: {"enable_thinking": false}), guided JSON decoding into{"x": int, "y": int}in [0, 1000], scaled to absolute pixels via original image size; scored as a hit iff the point falls inside the ground-truth region. Unparseable output = miss.bench_eval.py, included in the predictions dataset), not this repo'sevaluation/scripts.osworld-g/).