Skip to content

Raidriar7170/voice2task-posttraining

Repository files navigation

Voice2Task Post-Training

中文 | English

Python Status Model Scope

Voice2Task 是一个中文 spoken command / ASR transcript 到浏览器任务合约的 post-training 项目。 任务不是控制浏览器,而是把用户口语命令转换成严格的 Browser Task Contract JSON,供后续浏览器 agent 决定搜索、打开 URL、 填写表单、抽取页面信息、澄清或拒绝高风险动作。

Recruiter Summary

Area What this project built
Problem 中文语音/ASR 浏览器命令 -> schema-valid browser task contract JSON
Model pipeline Qwen2.5-7B-Instruct + LoRA SFT;adapter 私有,不随仓库发布
Data/training 247 seeds / 696 SFT rows / 2,100 preference pairs;final SFT 只使用既有训练数据
Prompt/eval hardening 统一 gold-free prompt policy unified_gold_free_v1;严格 JSON parse -> strict schema -> semantic contract -> exact match 分层验证
Frozen evaluation 120-row lockbox-v1,120 semantic families,manifest frozen,one-look final evaluation
Result boundary final SFT 没有提升 strict contract exact match;no overall model improvement claim

Final Lockbox v1 Result

Frozen protocol: lockbox_hash=06114cf3ad6029930284af5f2245fb2c4a8174fd35c6a1107f4c73482b555b33, prompt policy unified_gold_free_v1, greedy decoding, schema guard + one schema retry, strict evaluator, two pre-registered arms only.

Metric Base Qwen2.5-7B Final SFT adapter Delta
contract_exact_match 0.0167 0.0083 -0.0083
semantic_contract_valid_rate 0.8250 0.8667 +0.0417
task_type_accuracy 0.7917 0.8583 +0.0667
route_accuracy 0.8000 0.8583 +0.0583
confirmation_accuracy 0.7083 0.7917 +0.0833
strict_schema_valid_rate 1.0000 0.9833 -0.0167
slot_f1 0.0417 0.0500 +0.0083
slot_f1_soft 0.3783 0.3867 +0.0084

Interpretation:

  • Final SFT did not improve strict contract exact match on the frozen lockbox.
  • Final SFT did improve several semantic/channel metrics: semantic_contract_valid_rate +0.0417, task_type_accuracy +0.0667, route_accuracy +0.0583, confirmation_accuracy +0.0833.
  • This is aggregate-only one-look evidence. Public reports do not include row-level failure analysis.

Evidence links:

Explicit Non-Claims

This repository does not claim:

  • overall model improvement from final SFT (no overall model improvement claim);
  • production readiness;
  • safety readiness;
  • executable browser quality;
  • DPO success;
  • adapter/checkpoint release;
  • live-browser benchmark improvement.

The strongest supported claim is narrower: under a frozen 120-row lockbox and a gold-free strict evaluator, final SFT improved several semantic/channel aggregate metrics but reduced strict full-contract exact match.

Repository Role

This repo is This repo is not
A speech/ASR-to-contract post-training evidence repository A generic chat fine-tuning project
A strict JSON contract generation and evaluation pipeline A GUI action policy or browser controller
A public-safe SFT/DPO data, training, prediction, and evaluation workflow A checkpoint or adapter release
A place where negative, blocked, and superseded evidence stays auditable A success story built by deleting inconvenient results

Method Overview

  1. Build public-safe Voice2Task data from seed traces into SFT and preference rows.
  2. Render Qwen chat prompts with no gold contract in prediction prompts.
  3. Train LoRA SFT adapters on existing training data only.
  4. Decode greedily with max_new_tokens=256, schema guard enabled, and at most one schema retry.
  5. Score with strict layered metrics: JSON parse, strict schema validity, semantic contract validity, exact match, slot-level metrics, route/task/confirmation/safety metrics.
  6. Freeze lockbox rows and manifest before the final one-look evaluation.

Quick Start

Install local tooling:

python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev,dataset]'

Rebuild and validate the committed public sample:

PYTHONPATH=src python -m voice2task.cli.data build-public \
  --seed data/public-samples/seed_traces.jsonl \
  --output data/public-samples

PYTHONPATH=src python -m voice2task.cli.data validate \
  --sft data/public-samples/sft_public_sample.jsonl \
  --dpo data/public-samples/dpo_public_sample.jsonl \
  --manifest data/public-samples/manifest_public_sample.json \
  --public

Run local baselines and metrics:

PYTHONPATH=src python -m voice2task.cli.eval baseline \
  --gold data/public-samples/sft_public_sample.jsonl \
  --output reports/public-sample/rule_baseline_predictions.jsonl

PYTHONPATH=src python -m voice2task.cli.eval metrics \
  --gold data/public-samples/sft_public_sample.jsonl \
  --predictions reports/public-sample/rule_baseline_predictions.jsonl \
  --output reports/public-sample

Dry-run training metadata export remains available, but real heavy training is gated by explicit config:

PYTHONPATH=src python -m voice2task.cli.train sft \
  --config configs/sft-dev.json \
  --manifest data/public-samples/manifest_public_sample.json \
  --output-dir reports/public-sample/sft-dry-run \
  --dry-run

PYTHONPATH=src python -m voice2task.cli.train dpo \
  --config configs/dpo-dev.json \
  --manifest data/public-samples/manifest_public_sample.json \
  --output-dir reports/public-sample/dpo-dry-run \
  --dry-run

Metric Interpretation Boundaries

contract_exact_match is a hard full-contract exact-match metric. normalized_command string-mismatch diagnostics are explanatory row-level evidence only: they do not relax, normalize, semantically score, repair, replace, or re-score predictions, and they do not automatically mark Chinese phrase differences such as 搜索/查询 or 明天的天气/明天天气 as equivalent.

normalized_command gold targets are canonical Chinese intent phrases, not verbatim transcripts or ASR text. This is target-writing guidance for SFT/DPO data and prompts, not evaluator-side normalization, semantic-equivalence scoring, prediction repair, or re-scoring.

Evidence Archive

Longer-running internal evidence remains documented below the headline result:

  • Contract V2 projection: PARTIAL_SCHEMA_BENEFIT; derived-field-only strict failures are 14.65%, normalized-command-only strict failures are 14.65%, and core slot failures remain 68.79% of V1 strict failures. This is useful schema-burden evidence, not model-quality evidence.
  • Copy-backed verification and shadow mode: observe-only provenance/interface evidence, not runtime enforcement.
  • Copy-shadow template-disjoint challenge v1: adversarial verifier fixture, not a naturalistic language benchmark.
  • Earlier step-matched SFT ablations: mixed/inconclusive; no stable broad canonical-slot benefit.

See current status and public evidence index for the complete archived map.

A100 Boundary

GPU-heavy training and prediction are designed for a private A100 development machine. Public repo artifacts intentionally omit checkpoints, LoRA adapters, raw logs, remote caches, private corpus rows, hostnames, SSH details, credentials, private paths, private override configs, and production-readiness claims.

Validation

Useful local checks:

PYTHONPATH=src pytest -q
PYTHONPATH=src ruff check src tests
OPENSPEC_TELEMETRY=0 openspec validate --all --strict
PYTHONPATH=src python scripts/check_current_truth_surface.py
git diff --check

License

本项目采用 MIT License

About

Evidence-first post-training and evaluation for Chinese ASR commands to schema-valid browser task contracts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages