Voice2Task Post-Training

Voice2Task 是一个中文 spoken command / ASR transcript 到浏览器任务合约的 post-training 项目。任务不是控制浏览器，而是把用户口语命令转换成严格的 Browser Task Contract JSON，供后续浏览器 agent 决定搜索、打开 URL、填写表单、抽取页面信息、澄清或拒绝高风险动作。

Recruiter Summary

Area	What this project built
Problem	中文语音/ASR 浏览器命令 -> schema-valid browser task contract JSON
Model pipeline	Qwen2.5-7B-Instruct + LoRA SFT；adapter 私有，不随仓库发布
Data/training	247 seeds / 696 SFT rows / 2,100 preference pairs；final SFT 只使用既有训练数据
Prompt/eval hardening	统一 gold-free prompt policy `unified_gold_free_v1`；严格 JSON parse -> strict schema -> semantic contract -> exact match 分层验证
Frozen evaluation	120-row `lockbox-v1`，120 semantic families，manifest frozen，one-look final evaluation
Result boundary	final SFT 没有提升 strict contract exact match；`no overall model improvement claim`

Final Lockbox v1 Result

Frozen protocol: lockbox_hash=06114cf3ad6029930284af5f2245fb2c4a8174fd35c6a1107f4c73482b555b33, prompt policy unified_gold_free_v1, greedy decoding, schema guard + one schema retry, strict evaluator, two pre-registered arms only.

Metric	Base Qwen2.5-7B	Final SFT adapter	Delta
`contract_exact_match`	0.0167	0.0083	-0.0083
`semantic_contract_valid_rate`	0.8250	0.8667	+0.0417
`task_type_accuracy`	0.7917	0.8583	+0.0667
`route_accuracy`	0.8000	0.8583	+0.0583
`confirmation_accuracy`	0.7083	0.7917	+0.0833
`strict_schema_valid_rate`	1.0000	0.9833	-0.0167
`slot_f1`	0.0417	0.0500	+0.0083
`slot_f1_soft`	0.3783	0.3867	+0.0084

Interpretation:

Final SFT did not improve strict contract exact match on the frozen lockbox.
Final SFT did improve several semantic/channel metrics: semantic_contract_valid_rate +0.0417, task_type_accuracy +0.0667, route_accuracy +0.0583, confirmation_accuracy +0.0833.
This is aggregate-only one-look evidence. Public reports do not include row-level failure analysis.

Evidence links:

Explicit Non-Claims

This repository does not claim:

overall model improvement from final SFT (no overall model improvement claim);
production readiness;
safety readiness;
executable browser quality;
DPO success;
adapter/checkpoint release;
live-browser benchmark improvement.

The strongest supported claim is narrower: under a frozen 120-row lockbox and a gold-free strict evaluator, final SFT improved several semantic/channel aggregate metrics but reduced strict full-contract exact match.

Repository Role

This repo is	This repo is not
A speech/ASR-to-contract post-training evidence repository	A generic chat fine-tuning project
A strict JSON contract generation and evaluation pipeline	A GUI action policy or browser controller
A public-safe SFT/DPO data, training, prediction, and evaluation workflow	A checkpoint or adapter release
A place where negative, blocked, and superseded evidence stays auditable	A success story built by deleting inconvenient results

Method Overview

Build public-safe Voice2Task data from seed traces into SFT and preference rows.
Render Qwen chat prompts with no gold contract in prediction prompts.
Train LoRA SFT adapters on existing training data only.
Decode greedily with max_new_tokens=256, schema guard enabled, and at most one schema retry.
Score with strict layered metrics: JSON parse, strict schema validity, semantic contract validity, exact match, slot-level metrics, route/task/confirmation/safety metrics.
Freeze lockbox rows and manifest before the final one-look evaluation.

Quick Start

Install local tooling:

python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev,dataset]'

Rebuild and validate the committed public sample:

PYTHONPATH=src python -m voice2task.cli.data build-public \
  --seed data/public-samples/seed_traces.jsonl \
  --output data/public-samples

PYTHONPATH=src python -m voice2task.cli.data validate \
  --sft data/public-samples/sft_public_sample.jsonl \
  --dpo data/public-samples/dpo_public_sample.jsonl \
  --manifest data/public-samples/manifest_public_sample.json \
  --public

Run local baselines and metrics:

PYTHONPATH=src python -m voice2task.cli.eval baseline \
  --gold data/public-samples/sft_public_sample.jsonl \
  --output reports/public-sample/rule_baseline_predictions.jsonl

PYTHONPATH=src python -m voice2task.cli.eval metrics \
  --gold data/public-samples/sft_public_sample.jsonl \
  --predictions reports/public-sample/rule_baseline_predictions.jsonl \
  --output reports/public-sample

Dry-run training metadata export remains available, but real heavy training is gated by explicit config:

PYTHONPATH=src python -m voice2task.cli.train sft \
  --config configs/sft-dev.json \
  --manifest data/public-samples/manifest_public_sample.json \
  --output-dir reports/public-sample/sft-dry-run \
  --dry-run

PYTHONPATH=src python -m voice2task.cli.train dpo \
  --config configs/dpo-dev.json \
  --manifest data/public-samples/manifest_public_sample.json \
  --output-dir reports/public-sample/dpo-dry-run \
  --dry-run

Metric Interpretation Boundaries

contract_exact_match is a hard full-contract exact-match metric. normalized_command string-mismatch diagnostics are explanatory row-level evidence only: they do not relax, normalize, semantically score, repair, replace, or re-score predictions, and they do not automatically mark Chinese phrase differences such as 搜索/查询 or 明天的天气/明天天气 as equivalent.

normalized_command gold targets are canonical Chinese intent phrases, not verbatim transcripts or ASR text. This is target-writing guidance for SFT/DPO data and prompts, not evaluator-side normalization, semantic-equivalence scoring, prediction repair, or re-scoring.

Evidence Archive

Longer-running internal evidence remains documented below the headline result:

Contract V2 projection: PARTIAL_SCHEMA_BENEFIT; derived-field-only strict failures are 14.65%, normalized-command-only strict failures are 14.65%, and core slot failures remain 68.79% of V1 strict failures. This is useful schema-burden evidence, not model-quality evidence.
Copy-backed verification and shadow mode: observe-only provenance/interface evidence, not runtime enforcement.
Copy-shadow template-disjoint challenge v1: adversarial verifier fixture, not a naturalistic language benchmark.
Earlier step-matched SFT ablations: mixed/inconclusive; no stable broad canonical-slot benefit.

See current status and public evidence index for the complete archived map.

A100 Boundary

GPU-heavy training and prediction are designed for a private A100 development machine. Public repo artifacts intentionally omit checkpoints, LoRA adapters, raw logs, remote caches, private corpus rows, hostnames, SSH details, credentials, private paths, private override configs, and production-readiness claims.

Validation

Useful local checks:

PYTHONPATH=src pytest -q
PYTHONPATH=src ruff check src tests
OPENSPEC_TELEMETRY=0 openspec validate --all --strict
PYTHONPATH=src python scripts/check_current_truth_surface.py
git diff --check

License

本项目采用 MIT License。

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.codex/skills		.codex/skills
configs		configs
data		data
docs		docs
openspec		openspec
reports		reports
scripts		scripts
src/voice2task		src/voice2task
tests		tests
.gitignore		.gitignore
CONTEXT.md		CONTEXT.md
LICENSE		LICENSE
README.md		README.md
README_en.md		README_en.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice2Task Post-Training

Recruiter Summary

Final Lockbox v1 Result

Explicit Non-Claims

Repository Role

Method Overview

Quick Start

Metric Interpretation Boundaries

Evidence Archive

A100 Boundary

Validation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice2Task Post-Training

Recruiter Summary

Final Lockbox v1 Result

Explicit Non-Claims

Repository Role

Method Overview

Quick Start

Metric Interpretation Boundaries

Evidence Archive

A100 Boundary

Validation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages