Summary
The clawker-support skill (claude-plugin/clawker-support/skills/clawker-support/) has no eval test cases. Changes to SKILL.md or the reference/ files are currently verified only by manual review — there is no way to measure whether an edit improves or regresses the skill's triggering accuracy or answer quality.
Add a skill eval suite so the skill can be benchmarked with the skill-creator eval tooling (/skill-creator supports running evals, benchmarking performance with variance analysis, and optimizing the description for triggering accuracy).
Why
- The skill is delivered to external users via the plugin marketplace; every patch bump ships unreviewed-by-eval behavior changes.
- The skill's core design principle is "minimal concrete details, look up sources just in time" — easy to silently break (e.g. an edit that makes the agent answer from memory instead of fetching docs).
- Triggering is intentionally broad (fires on
.clawker.yaml, blocked domains, build errors, post_init, etc. without the word "clawker") — description edits risk both under- and over-triggering, which evals can catch.
Scope
Create an evals/ directory for the skill with test cases covering at least:
Triggering accuracy
- Positive: questions that should trigger without saying "clawker" (e.g. "my domain is blocked", "why does my .clawker.yaml not add this package", "post_init didn't run")
- Negative: generic Docker/firewall questions that should NOT trigger the skill
Answer quality / methodology adherence
- Config-level awareness: project-specific config must land in project-level
.clawker.yaml, never user-level (~/.config/clawker/clawker.yaml)
- Source discipline: recommendations grounded in fetched docs/reference files, not training data
- Firewall path-scoping: narrowest-scope rule recommended (
--path/--action, path_rules) rather than bare domain allows
- Troubleshooting routing: symptom → correct domain reference file (build vs firewall vs MCP vs monitoring vs known-issues)
- Known-issues lookup: a reported symptom that matches
reference/known-issues.md should surface the documented workaround
Regression cases
- Past failure modes worth pinning (e.g. recommending user-level config pollution, inventing config field names, hardcoding firewall domains)
Acceptance criteria
Summary
The
clawker-supportskill (claude-plugin/clawker-support/skills/clawker-support/) has no eval test cases. Changes to SKILL.md or thereference/files are currently verified only by manual review — there is no way to measure whether an edit improves or regresses the skill's triggering accuracy or answer quality.Add a skill eval suite so the skill can be benchmarked with the skill-creator eval tooling (
/skill-creatorsupports running evals, benchmarking performance with variance analysis, and optimizing the description for triggering accuracy).Why
.clawker.yaml, blocked domains, build errors, post_init, etc. without the word "clawker") — description edits risk both under- and over-triggering, which evals can catch.Scope
Create an
evals/directory for the skill with test cases covering at least:Triggering accuracy
Answer quality / methodology adherence
.clawker.yaml, never user-level (~/.config/clawker/clawker.yaml)--path/--action,path_rules) rather than bare domain allowsreference/known-issues.mdshould surface the documented workaroundRegression cases
Acceptance criteria
claude-plugin/clawker-support/CLAUDE.mdcompletion gate updated to mention running evals after skill changes