Skip to content

chore(clawker-support): add skill eval test cases for the support skill #358

Description

@schmitthub

Summary

The clawker-support skill (claude-plugin/clawker-support/skills/clawker-support/) has no eval test cases. Changes to SKILL.md or the reference/ files are currently verified only by manual review — there is no way to measure whether an edit improves or regresses the skill's triggering accuracy or answer quality.

Add a skill eval suite so the skill can be benchmarked with the skill-creator eval tooling (/skill-creator supports running evals, benchmarking performance with variance analysis, and optimizing the description for triggering accuracy).

Why

  • The skill is delivered to external users via the plugin marketplace; every patch bump ships unreviewed-by-eval behavior changes.
  • The skill's core design principle is "minimal concrete details, look up sources just in time" — easy to silently break (e.g. an edit that makes the agent answer from memory instead of fetching docs).
  • Triggering is intentionally broad (fires on .clawker.yaml, blocked domains, build errors, post_init, etc. without the word "clawker") — description edits risk both under- and over-triggering, which evals can catch.

Scope

Create an evals/ directory for the skill with test cases covering at least:

Triggering accuracy

  • Positive: questions that should trigger without saying "clawker" (e.g. "my domain is blocked", "why does my .clawker.yaml not add this package", "post_init didn't run")
  • Negative: generic Docker/firewall questions that should NOT trigger the skill

Answer quality / methodology adherence

  • Config-level awareness: project-specific config must land in project-level .clawker.yaml, never user-level (~/.config/clawker/clawker.yaml)
  • Source discipline: recommendations grounded in fetched docs/reference files, not training data
  • Firewall path-scoping: narrowest-scope rule recommended (--path/--action, path_rules) rather than bare domain allows
  • Troubleshooting routing: symptom → correct domain reference file (build vs firewall vs MCP vs monitoring vs known-issues)
  • Known-issues lookup: a reported symptom that matches reference/known-issues.md should surface the documented workaround

Regression cases

  • Past failure modes worth pinning (e.g. recommending user-level config pollution, inventing config field names, hardcoding firewall domains)

Acceptance criteria

  • Eval cases live alongside the skill and follow the skill-creator eval format
  • At least one positive + one negative triggering case, plus methodology cases above
  • Baseline run documented (so future SKILL.md edits can be compared)
  • claude-plugin/clawker-support/CLAUDE.md completion gate updated to mention running evals after skill changes
  • Patch version bump per plugin versioning policy

Metadata

Metadata

Assignees

No one assigned

    Labels

    tech-debtCode quality, refactoring, and architectural cleanup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions