Skip to content

feat(skill-creator): add --eval-model option for separate eval and improvement models#851

Open
heshaofu2 wants to merge 1 commit intoanthropics:mainfrom
heshaofu2:feat/skill-creator-eval-model-option
Open

feat(skill-creator): add --eval-model option for separate eval and improvement models#851
heshaofu2 wants to merge 1 commit intoanthropics:mainfrom
heshaofu2:feat/skill-creator-eval-model-option

Conversation

@heshaofu2
Copy link
Copy Markdown

Summary

Adds an optional --eval-model CLI argument to run_loop.py, allowing users to specify a different model for the eval trigger-testing phase while --model remains the primary model for description improvement.

Motivation

The run_loop.py script has two distinct phases with very different cost/quality profiles:

Phase Calls Concurrency Priority
Eval (trigger testing) Many (num_workers × runs_per_query × eval_set_size) High (default 10 workers) Speed & cost
Improve (description rewrite) 1 per iteration None Quality

Currently both phases are forced to use the same --model. In practice, users often want a lighter/cheaper model for the high-volume eval phase and a stronger model for the single-call description improvement. For example:

# Before: forced to pick one model for both phases
python3 -m scripts.run_loop --model claude-opus-4-6 ...   # expensive for eval
python3 -m scripts.run_loop --model claude-haiku-4-5 ...  # weak for improvement

# After: best of both worlds
python3 -m scripts.run_loop \
  --model claude-opus-4-6 \           # strong model for description improvement
  --eval-model claude-haiku-4-5 \     # fast/cheap model for trigger testing
  ...

Changes

4 minimal changes in skills/skill-creator/scripts/run_loop.py:

  1. run_loop() function: Added eval_model: str | None = None parameter
  2. run_eval() call: Changed model=modelmodel=eval_model or model (falls back to --model when not set)
  3. improve_description() call: Unchanged — always uses --model (the primary model)
  4. CLI: Added --eval-model argument; updated --model help text from "Model for improvement" to "Model for all stages (improvement and eval)"

Backward Compatibility

Fully backward compatible. Without --eval-model, behavior is identical to the current version — both phases use --model.

Testing

Tested locally with:

# Without --eval-model (backward compat): works identically to before
python3 -m scripts.run_loop --model claude-sonnet-4-6 \
  --eval-set evals/eval_set.json --skill-path . --verbose

# With --eval-model: eval uses haiku, improvement uses opus
python3 -m scripts.run_loop --model claude-opus-4-6 --eval-model claude-haiku-4-5 \
  --eval-set evals/eval_set.json --skill-path . --verbose

Both modes produce correct results. The --eval-model variant completes eval phases significantly faster while maintaining improvement quality.

Allow using a separate model for eval trigger testing while keeping
--model as the primary model for description improvement. This enables
cost-effective workflows where a lighter model handles the high-volume
parallel eval phase and a stronger model handles the single-call
description rewrite phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant