feat: HERM v2 — H8 rule for evaluation pipeline bias detection

## Problem

Current HERM taxonomy covers H1-H7. A new failure mode was identified from experiment L2-003: **evaluation pipeline bias** — where the scoring/evaluation instructions in agent configs systematically favor certain output styles over quality.

**Evidence (L2-003):** A keyword-based scorer in an agent's evaluation config matched "rate limiting" but missed "size/rate limiting" — Opus used richer vocabulary for the same concept. This caused the evaluation to report Sonnet > Opus when the opposite was true. The scoring instruction was the bug, not the model.

## Proposed H8 Rule

**H8: Evaluation Methodology Bias** — Agent configs that include evaluation or scoring instructions should be checked for:

1. **Keyword-match dependency** — scoring that relies on exact phrase matching
2. **Length bias** — rubrics that penalize or reward response length without semantic grounding  
3. **Format preference** — evaluation instructions that prefer bullet lists over prose regardless of task type
4. **Model-specific artifacts** — scoring criteria calibrated to one model's output style

```python
H8_SIGNALS = [
    r"score.*if.*contains",         # keyword-match scoring
    r"points.*for.*bullet",         # format preference
    r"penalize.*longer",            # length bias
    r"compare.*to.*baseline",       # potentially model-specific
]
```

## Why This Matters
As agents are used to evaluate other agents (judge LLM pattern), evaluation config quality becomes critical. H8 catches bias at the config layer — before it inverts conclusions in production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: HERM v2 — H8 rule for evaluation pipeline bias detection #10

Problem

Proposed H8 Rule

Why This Matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: HERM v2 — H8 rule for evaluation pipeline bias detection #10

Description

Problem

Proposed H8 Rule

Why This Matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions