feat(skill-creator): add --eval-model option for separate eval and improvement models by heshaofu2 · Pull Request #851 · anthropics/skills

heshaofu2 · 2026-04-02T04:21:58Z

Summary

Adds an optional --eval-model CLI argument to run_loop.py, allowing users to specify a different model for the eval trigger-testing phase while --model remains the primary model for description improvement.

Motivation

The run_loop.py script has two distinct phases with very different cost/quality profiles:

Phase	Calls	Concurrency	Priority
Eval (trigger testing)	Many (num_workers × runs_per_query × eval_set_size)	High (default 10 workers)	Speed & cost
Improve (description rewrite)	1 per iteration	None	Quality

Currently both phases are forced to use the same --model. In practice, users often want a lighter/cheaper model for the high-volume eval phase and a stronger model for the single-call description improvement. For example:

# Before: forced to pick one model for both phases
python3 -m scripts.run_loop --model claude-opus-4-6 ...   # expensive for eval
python3 -m scripts.run_loop --model claude-haiku-4-5 ...  # weak for improvement

# After: best of both worlds
python3 -m scripts.run_loop \
  --model claude-opus-4-6 \           # strong model for description improvement
  --eval-model claude-haiku-4-5 \     # fast/cheap model for trigger testing
  ...

Changes

4 minimal changes in skills/skill-creator/scripts/run_loop.py:

run_loop() function: Added eval_model: str | None = None parameter
run_eval() call: Changed model=model → model=eval_model or model (falls back to --model when not set)
improve_description() call: Unchanged — always uses --model (the primary model)
CLI: Added --eval-model argument; updated --model help text from "Model for improvement" to "Model for all stages (improvement and eval)"

Backward Compatibility

Fully backward compatible. Without --eval-model, behavior is identical to the current version — both phases use --model.

Testing

Tested locally with:

# Without --eval-model (backward compat): works identically to before
python3 -m scripts.run_loop --model claude-sonnet-4-6 \
  --eval-set evals/eval_set.json --skill-path . --verbose

# With --eval-model: eval uses haiku, improvement uses opus
python3 -m scripts.run_loop --model claude-opus-4-6 --eval-model claude-haiku-4-5 \
  --eval-set evals/eval_set.json --skill-path . --verbose

Both modes produce correct results. The --eval-model variant completes eval phases significantly faster while maintaining improvement quality.

Allow using a separate model for eval trigger testing while keeping --model as the primary model for description improvement. This enables cost-effective workflows where a lighter model handles the high-volume parallel eval phase and a stronger model handles the single-call description rewrite phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skill-creator): add --eval-model option for separate eval and improvement models#851

feat(skill-creator): add --eval-model option for separate eval and improvement models#851
heshaofu2 wants to merge 1 commit intoanthropics:mainfrom
heshaofu2:feat/skill-creator-eval-model-option

heshaofu2 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heshaofu2 commented Apr 2, 2026

Summary

Motivation

Changes

Backward Compatibility

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant