Problem
Current HERM taxonomy covers H1-H7. A new failure mode was identified from experiment L2-003: evaluation pipeline bias — where the scoring/evaluation instructions in agent configs systematically favor certain output styles over quality.
Evidence (L2-003): A keyword-based scorer in an agent's evaluation config matched "rate limiting" but missed "size/rate limiting" — Opus used richer vocabulary for the same concept. This caused the evaluation to report Sonnet > Opus when the opposite was true. The scoring instruction was the bug, not the model.
Proposed H8 Rule
H8: Evaluation Methodology Bias — Agent configs that include evaluation or scoring instructions should be checked for:
- Keyword-match dependency — scoring that relies on exact phrase matching
- Length bias — rubrics that penalize or reward response length without semantic grounding
- Format preference — evaluation instructions that prefer bullet lists over prose regardless of task type
- Model-specific artifacts — scoring criteria calibrated to one model's output style
H8_SIGNALS = [
r"score.*if.*contains", # keyword-match scoring
r"points.*for.*bullet", # format preference
r"penalize.*longer", # length bias
r"compare.*to.*baseline", # potentially model-specific
]
Why This Matters
As agents are used to evaluate other agents (judge LLM pattern), evaluation config quality becomes critical. H8 catches bias at the config layer — before it inverts conclusions in production.
Problem
Current HERM taxonomy covers H1-H7. A new failure mode was identified from experiment L2-003: evaluation pipeline bias — where the scoring/evaluation instructions in agent configs systematically favor certain output styles over quality.
Evidence (L2-003): A keyword-based scorer in an agent's evaluation config matched "rate limiting" but missed "size/rate limiting" — Opus used richer vocabulary for the same concept. This caused the evaluation to report Sonnet > Opus when the opposite was true. The scoring instruction was the bug, not the model.
Proposed H8 Rule
H8: Evaluation Methodology Bias — Agent configs that include evaluation or scoring instructions should be checked for:
Why This Matters
As agents are used to evaluate other agents (judge LLM pattern), evaluation config quality becomes critical. H8 catches bias at the config layer — before it inverts conclusions in production.