SkillRevise

Official release for SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision.

SkillRevise improves cold-start agent skills by treating a skill as an execution-grounded artifact. Starting from an initial LLM-authored skill, it executes the task, diagnoses verifier-facing failures, retrieves reusable repair principles, revises the skill with execution anchors, re-executes the candidate, and returns the best observed skill under a utility gate.

Highlights

Trace-conditioned skill revision. SkillRevise revises imperfect skills from execution traces and verifier feedback instead of relying only on expert authoring, retrieval, or one-shot skill generation.
Three-part method. The framework combines task-specific Diagnosis, reusable Principle Memory, and an anchored Revision Operator.
Utility-gated selection. Each candidate skill is re-executed and selected by measured utility, so the returned skill is the best observed version within the revision budget.
Unified benchmark interface. The release includes final SkillsBench-style bundles for SkillsBench, SkillLearnBench-Random, and SWE-Skills-Bench-Hard.

Main Results

The paper evaluates SkillRevise across three verifier-driven benchmarks and five executors. The table below highlights representative settings from the main results, together with the corresponding no-skill and one-shot Skill-Creator baselines for the same executor:

Benchmark	Executor	No skill	Skill-Creator	SkillRevise v3
SkillsBench	GPT-5.5	31/86	34/86	53/86
SkillLearnBench-Random	Opus-4.7	7/50	7/50	25/50
SWE-Skills-Bench-Hard	Qwen-3.6-Plus	22/70	24/70	35/70

The revised skills also show cross-model transfer behavior: fixed GPT-5.5-produced skills improve several target executors on the 57-task GPT-5.5 source-success subset, while executor-specific revision remains strongest in most cases.

Repository Contents

skillrevise/: Python package and CLI.
- core/: task/result models, execution loop, runners, metrics, artifacts, and reporting.
- method/: skill authoring, diagnosis, revision, skill parsing, and principle memory.
- benchmarks/: SkillsBench loading, SkillLearnBench and SWE-Skills-Bench conversion, ALFWorld loading, verifier helpers, and task selection.
- llm/: command-based LLM client and provider wrapper used by skillrevise-llm.
data/: final SkillsBench-style benchmark bundles used by the main experiments.
scripts/: benchmark export, task selection, and skill-audit helpers.
docs/: method, benchmark, and running documentation.
tests/: unit tests for the public package and scripts.

Quick Start

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e ".[dev,benchmarks]"
pytest

Run a one-task baseline smoke check:

skillrevise data/skillsbench/skillsbench_tasks.json \
  --manifest-kind skillsbench \
  --workspace-root . \
  --baseline-only \
  --limit 1 \
  --output runs/baseline_smoke.json

Run an LLM-backed SkillRevise episode:

skillrevise data/skillsbench/skillsbench_tasks.json \
  --manifest-kind skillsbench \
  --workspace-root . \
  --limit 1 \
  --author-mode llm-principle-bank \
  --diagnosis-mode llm \
  --revision-mode llm-principle-bank \
  --llm-command "python -m skillrevise.llm.command" \
  --output runs/llm_run.json \
  --summary-output runs/llm_summary.json \
  --max-revisions 3

The bundled skillrevise-llm command reads prompts from stdin and writes completions to stdout. Configure provider credentials with environment variables such as:

export SKILL_REVISE_REVISION_LLM_PROVIDER=openai
export SKILL_REVISE_REVISION_LLM_MODEL="<model-name>"
export SKILL_REVISE_REVISION_LLM_API_KEY="<api-key>"

Benchmark Bundles

All three main benchmark bundles are already materialized in a SkillsBench-style layout and can be loaded with --manifest-kind skillsbench:

Bundle	Tasks	Manifest
`data/skillsbench/`	86	`data/skillsbench/skillsbench_tasks.json`
`data/skilllearnbench/`	50	`data/skilllearnbench/skillsbench_tasks.json`
`data/swe-skills-bench/`	70	`data/swe-skills-bench/skillsbench_tasks.json`

See docs/benchmarks.md for bundle structure and loading details.

Documentation

Citation

If you use SkillRevise in research, please cite the paper:

@misc{liu2026skillrevise,
  title = {SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision},
  author = {Liu, Yuxuan and Su, Zhaochen and Xie, Lingyun and Zhang, Yuhao and Zong, Qing and Guo, Jiahe and Xie, Zhongwei and Ji, Yiyan and Yim, Yauwai and Luo, Hongyu and Ren, Xiyu and Ruan, Chenyu and Li, Haoran and Song, Yangqiu},
  year = {2026},
  eprint = {2606.01139},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2606.01139}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
scripts		scripts
skillrevise		skillrevise
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SkillRevise

Highlights

Main Results

Repository Contents

Quick Start

Benchmark Bundles

Documentation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SkillRevise

Highlights

Main Results

Repository Contents

Quick Start

Benchmark Bundles

Documentation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages