Rubric Refinement Workflow

Overview of the Iterative Process

Rubric refinement is fundamentally an iterative cycle designed to bring AI grading into alignment with human judgment. Each iteration follows a systematic process:

run → review → grade → refine → save → re-run → repeat

You'll typically need several iterations before achieving acceptable alignment. This is expected and normal—rubric refinement is a process of progressive improvement, not a one-time configuration.

Step-by-Step Workflow

Step 1: Starting a Rubric Refinement Run

Prerequisites

A test set containing Generative Answer test cases
A rubric (initial or previous iteration) to refine

Configuration

Create or duplicate a test run
- Navigate to your test set
- Click "New" or "Duplicate Run" if refining from a previous iteration
Configure the test run:
- Name: Give it a descriptive name (e.g., "Automotive Repair Rubric - Iteration 2")
- Agent Test Set: Select your test set with Generative Answer test cases
- Agent Configuration: Choose the agent configuration to test
- Rubric: Select the rubric to refine
- Passing Grade: Set the passing grade (default: 5)
  - Remember: In refinement mode, this is informational only
  - It helps identify responses below your target threshold
Execute the test run
- Click "Save & Close" then execute the run
- The system runs each Generative Answer test case
- AI grading is performed with detailed reasoning

What Happens During Execution

For each Generative Answer test case:
- The agent responds to the test utterance
- The AI judge evaluates the response using the rubric
- A grade (1-5) is assigned
- A detailed rationale explaining the grade is generated
Non-Generative Answer test cases are skipped
- Only GA test types are evaluated in refinement mode

Step 2: Reviewing Results - Rubric Refinement Views

After the test run completes, you'll use specialized rubric refinement views to assess alignment. The interface provides two main views to support unbiased human evaluation.

Why Two Views?

Research shows that seeing AI grades can bias human judgment. To avoid this:

Standard Refinement View: Hides AI grades for unbiased human evaluation
Full Refinement View: Shows AI grades for alignment comparison

You'll switch between these views depending on your task.

Standard Refinement View (AI Grades Hidden)

Purpose: Provide unbiased human evaluation without AI influence

When to Use:

When first reviewing responses and providing human grades
When setting good/bad examples
When you want to avoid bias from AI assessments

What's Visible:

Test utterance (what the user asked)
Agent response (what the copilot answered)
Human grading fields (initially empty):
- Human Grade selector (1-5)
- Human Reasoning text field
- Marked as Example toggle
- Example Type selector (Good Example / Bad Example)

What's Hidden:

AI grade
AI rationale

Navigation:

List view shows all test cases in a table
Click any row to open detailed view for longer responses

Full Refinement View (AI Grades Visible)

Purpose: Compare AI and human assessments to understand alignment

When to Use:

After providing human grades, to compare with AI
To understand why misalignment occurred
To identify patterns in AI reasoning
When analyzing which responses need examples

What's Visible:

Everything from standard view, plus:
AI Grade (1-5)
AI Rationale (detailed reasoning)
Alignment Indicator: Visual indicator showing whether human and AI grades match
- ✅ Aligned: AI grade = Human grade
- ❌ Misaligned: AI grade ≠ Human grade

Key Insight: The Full view helps you understand misalignment patterns:

Is the AI consistently too lenient or too strict?
What criteria does the AI emphasize or miss?
Which types of responses confuse the AI?

Step 3: Human Grading Process

Providing accurate, thoughtful human grades is critical for effective rubric refinement.

Option A: List View Grading

Use the list view (Standard or Full) for quick grading:

Review each test case in the table
Read the test utterance and agent response
Assign a grade (1-5) based on the rubric criteria
Enter reasoning explaining your grade
Move to the next test case

Advantages:

Fast for short responses
Good overview of all cases
Efficient for large test sets

Disadvantages:

Limited space for viewing longer responses
Less detailed context

Option B: Detailed View Grading

Use the detailed view for longer, complex responses:

Click any test case in the list view to open detailed view
Review the full conversation context:
- Test utterance displayed prominently
- Complete agent response with full formatting
Navigate between test cases:
- Open and use the record browser view on the left to select different test cases
- Switch between "Refinement" and "Refinement (full)" tabs
Assign human grade:
- Select grade from dropdown (1-5)
- Required: You must provide a grade
Enter human reasoning:
- Explain why you assigned this grade
- Reference specific rubric criteria
- Note any strengths or weaknesses
- Required: Reasoning is mandatory for refinement
Mark as example if applicable (see Step 4)

Advantages:

Full context for complex responses
Better for careful, thoughtful evaluation
Easier to reference rubric while grading

Detailed View Components

The detailed view includes two tabs:

Refinement Tab (AI Hidden):

Test utterance
Agent response
Human Grade selector (required)
Human Reasoning text field (required)
Marked as Example toggle
Example Type selector

Refinement (Full) Tab (AI Visible):

All of the above, plus:
AI Grade
AI Rationale
Alignment indicator

Best Practices for Human Grading

Grade without seeing AI assessment first:
- Use the Standard Refinement View or "Refinement" tab
- Form your own opinion before comparing with AI
Be consistent with rubric criteria:
- Reference the rubric definitions
- Apply the same standards across all test cases
- Don't grade "on a curve"—use absolute standards

Write detailed reasoning:

Explain which rubric criteria were met or missed
Be specific about strengths and weaknesses
Your reasoning helps the AI understand what to focus on

Example Good Reasoning:

Grade 4 (Strong): Response includes accurate technical information about the
issue and provides clear next steps. Tone is professional and reassuring.
However, it lacks specific timeline estimates that would make it exemplary
(Grade 5). All major IR elements present but could be more investor-ready.

Example Poor Reasoning:

Grade 4: Pretty good.

Don't be influenced by pass/fail:
- Remember: The goal is alignment, not passing scores
- Grade honestly based on quality, not desired outcomes
Grade all test cases consistently:
- Complete grading for the entire test set
- Inconsistent grading reduces refinement effectiveness

Step 4: Marking Examples

Examples significantly improve rubric quality by providing concrete illustrations of good and bad responses.

Purpose of Examples

Examples help the AI understand:

What good actually looks like in practice
What bad responses contain that makes them unacceptable
Edge cases where criteria might be unclear
Context and nuance that text descriptions can't fully capture

How to Mark Examples

In either the list view or detailed view:

Toggle "Marked as Example" to Yes
Select "Example Type":
- Good Example: Illustrates high-quality response
- Bad Example: Illustrates poor-quality response
The system captures:
- Test utterance
- Agent response
- Your example designation (good/bad)

When to Mark Examples

Consider marking a test case as an example if:

✅ Good Example:

Response is Grade 5 (Exemplary) and demonstrates ideal quality
Response is Grade 4 and shows all key criteria even if not perfect
Response handles a complex scenario particularly well
You want to reinforce specific positive behaviors

✅ Bad Example:

Response is Grade 1 (Needs Improvement) and clearly fails expectations
Response is Grade 2 and shows common errors to avoid
Response demonstrates a specific pitfall or failure mode
You want to clarify what "not acceptable" looks like

✅ Either:

The test case represents an edge case or ambiguous scenario
There's misalignment between AI and human grades (helps clarify your standards)
The test case is particularly representative of common patterns

How Many Examples?

Quality over quantity: A few well-chosen examples are better than many mediocre ones
Balance good and bad: Include both types to set clear boundaries
Represent variety: Cover different types of queries and responses

Tip: Focus on test cases where there's misalignment—these are often the most valuable examples because they clarify where the rubric is ambiguous.

Step 5: Understanding Alignment

Alignment indicates whether the AI and human grades match.

Individual Test Case Alignment

For each test case:

✅ Aligned: AI grade = Human grade
❌ Misaligned: AI grade ≠ Human grade

Alignment Calculation for Individual Test Cases

For each test case, alignment between AI-generated responses and human expectations is calculated using a linear interpolation from 100% (perfect match) to 0% (maximally different).

Interpretation:

When AI grade = Human grade, alignment = 100%
As the grade difference increases, alignment decreases linearly
Maximum difference on a 1–5 scale is 4 steps, which maps to 0%

Formula:

alignment = 100% * (1 - |AI - Human| / 4)

Resulting Alignment Matrix:

AI ↓ / Human →	1	2	3	4	5
1	100%	75%	50%	25%	0%
2	75%	100%	75%	50%	25%
3	50%	75%	100%	75%	50%
4	25%	50%	75%	100%	75%
5	0%	25%	50%	75%	100%

Properties:

Symmetric: The alignment is the same regardless of which value is AI and which is Human
Diagonal = 100%: Perfect alignment when AI and Human grades match
Far edges = 0%: Maximum misalignment when grades are at opposite ends of the scale
Smooth, linear interpolation: Alignment decreases uniformly as the difference increases

What's Good Alignment?

There's no absolute standard, but general guidelines (Average Alignment Percentage for entire Rubric Refinement Run):

Alignment %	Assessment	Action
90-100%	Excellent	Rubric is highly reliable; consider moving to testing mode
75-89%	Good	Rubric is mostly aligned; refine edge cases
60-74%	Fair	Rubric needs improvement; focus on misalignment patterns
< 60%	Poor	Rubric requires significant refinement or redesign

Analyzing Misalignment

When misalignment occurs, investigate:

Direction of misalignment:
- Is AI consistently grading higher or lower than humans?
- This indicates systematic bias in the rubric
Patterns in misaligned cases:
- Do certain types of responses cause misalignment?
- Are specific criteria being misinterpreted?
Magnitude of misalignment:
- Off by 1 point (e.g., AI=4, Human=5) → Minor refinement needed
- Off by 2+ points (e.g., AI=2, Human=5) → Major clarification needed
Read AI rationales (Full view):
- What criteria did the AI focus on?
- What did the AI miss that humans noticed?
- Is the AI misinterpreting rubric language?

Step 6: Refining and Saving Rubrics

Once you've provided human grades, marked examples, and analyzed alignment, you're ready to refine the rubric.

How Rubric Refinement Works

When you click "Refine Rubric," an AI analyzes:

Current rubric content:
- "What good response looks like" description
- Grade definitions (1-5)
- Existing examples
Human grades and reasoning:
- How you assessed each response
- Your explanations for grades
AI grades and rationale:
- How the AI interpreted the rubric
- Where AI reasoning diverged from humans
Marked examples:
- Test utterance + response pairs
- Good/bad designations
- Context about what makes them exemplary or poor
Alignment patterns:
- Which criteria cause consistent misalignment
- Where the rubric is ambiguous
- What needs clarification

The AI then updates the rubric to improve alignment:

Clarifies ambiguous language
Adds specificity to grade definitions
Incorporates examples
Adjusts emphasis on criteria
Removes contradictions or confusion

Refining the Rubric

In the rubric refinement view, click "Refine Rubric" button
The system processes all grading data and generates refinement
Review the refined rubric (the system may show a preview)
Choose save option:

Save Options

Option 1: Save (Overwrite)

Replaces the existing rubric with the refined version
Original version is lost (no version history in MVP)
Use when: You're iterating on the same rubric and don't need to preserve history

Option 2: Save As (Create New)

Creates a new rubric with a different name
Preserves the original rubric
Use when: You want to compare versions or preserve a baseline

Recommendation: For early iterations, use "Save As" to preserve history (e.g., "IR Report v1", "IR Report v2"). Once the rubric stabilizes, use "Save" to iterate.

Step 7: Re-running Tests

After saving the refined rubric, test whether alignment has improved:

Duplicate the test run:
- Navigate to the completed test run
- Click "Duplicate Run"
Execute with the updated rubric:
- The duplicated run uses the refined rubric
- Run it against the same test set
Compare alignment metrics:
- Did alignment percentage increase?
- Are previously misaligned cases now aligned?
Review the results:
- Use the Full Refinement View to check alignment
- Note any remaining misalignment

Step 8: Iteration

Rubric refinement is rarely complete in one cycle. Continue iterating:

Re-run → Review → Grade → Refine → Save → Repeat
Track progress:
- Monitor alignment percentage across iterations
- Note which types of misalignment decrease
Adjust strategy:
- If alignment isn't improving, reconsider rubric approach
- Add more examples for persistent misalignment patterns
- Simplify overly complex criteria
Know when to stop:
- When alignment reaches acceptable level (typically 75-90%+)
- When additional iterations show diminishing returns
- When the rubric reliably identifies quality issues

Complete Workflow Summary

Step	Action	Output
1. Start Run	Configure test run with rubric at test run level	AI grades + rationales generated
2. Review	Open rubric refinement view (Standard or Full)	Context for grading
3. Grade (Human)	Provide grades (1-5) and reasoning for all test cases	Human assessments recorded
4. Mark Examples	Select good/bad examples	Examples captured for refinement
5. Understand Alignment	Compare AI and human grades; analyze patterns	Alignment percentage and insights
6. Refine & Save	Click "Refine Rubric"; save or save as	Updated rubric
7. Re-run	Duplicate run and execute with refined rubric	New alignment data
8. Iterate	Repeat until alignment is acceptable	Reliable, trustworthy rubric

Tips for Successful Refinement

1. Start with a Reasonable Baseline Rubric

Don't expect perfection from your initial rubric. Focus on:

Clear "what good looks like" statement
Basic grade definitions with observable criteria
You'll refine from there

2. Use Consistent Test Cases

Use the same test set across iterations to track alignment improvement. Changing test cases makes it hard to measure progress.

3. Grade Honestly and Thoroughly

Your human grades are the "ground truth" for refinement. Invest time in:

Thoughtful evaluation
Detailed reasoning
Consistent application of standards

4. Focus on Patterns, Not Individual Cases

One misaligned test case isn't a crisis. Look for:

Systematic patterns in misalignment
Repeated errors across multiple test cases

5. Use Examples Strategically

Examples are powerful but don't overdo it:

Focus on misaligned cases and edge cases
Balance good and bad examples

6. Review AI Rationales

The AI's reasoning (Full view) reveals how it interprets your rubric:

If AI focuses on the wrong criteria, clarify the rubric
If AI misses key points, make them more prominent
If AI uses odd logic, add examples to guide it

7. Iterate with Purpose

Each iteration should target specific improvements:

Iteration 1: Fix major misalignment patterns
Iteration 2: Add examples for edge cases
Iteration 3: Fine-tune language for remaining misalignment

8. Don't Chase 100% Alignment

Perfect alignment is rarely achievable or necessary:

Some subjectivity is inherent in evaluation
Diminishing returns after ~85-90% alignment
Focus on trustworthy grading, not perfection

9. Test with Fresh Data Eventually

After several iterations, test the rubric on new test cases:

Ensures the rubric generalizes beyond your training set
Prevents overfitting to specific examples

Troubleshooting Common Issues

Issue: Alignment Not Improving

Possible Causes:

Rubric criteria are too subjective or vague
Human grading is inconsistent
Examples don't clearly illustrate standards

Solutions:

Simplify rubric language; use concrete, observable criteria
Re-grade test cases with more consistent standards
Add more examples, especially for edge cases

Issue: AI Is Consistently Too Lenient or Too Strict

Possible Causes:

Grade definitions don't clearly differentiate levels
"What good looks like" sets wrong expectations

Solutions:

Sharpen distinctions between grade levels
Add examples showing borderline cases
Revise "what good looks like" to set proper baseline

Issue: Human and AI Focus on Different Criteria

Possible Causes:

Rubric lists too many criteria without prioritization
Important criteria are buried or unclear

Solutions:

Prioritize criteria explicitly in grade definitions
Lead with most important criteria
Use examples to demonstrate priority

Issue: Long Responses Are Hard to Grade

Solutions:

Use Detailed View for full context
Break evaluation into dimensions (accuracy, tone, completeness) and assess each
Consider whether rubric should focus on specific aspects rather than overall quality

Issue: Test Cases Are Too Similar

Solutions:

Expand test set to include variety
Add test cases covering edge cases, different query types, complex scenarios
Use test case generation (P1 feature) when available

Previous: ← Using Rubrics in Tests | Next: Best Practices →

FilesExpand file tree

04-rubric-refinement-workflow.md

Latest commit

History

04-rubric-refinement-workflow.md

File metadata and controls

Rubric Refinement Workflow

Overview of the Iterative Process

Step-by-Step Workflow

Step 1: Starting a Rubric Refinement Run

Prerequisites

Configuration

What Happens During Execution

Step 2: Reviewing Results - Rubric Refinement Views

Why Two Views?

Standard Refinement View (AI Grades Hidden)

Full Refinement View (AI Grades Visible)

Step 3: Human Grading Process

Option A: List View Grading

Option B: Detailed View Grading

Detailed View Components

Best Practices for Human Grading

Step 4: Marking Examples

Purpose of Examples

How to Mark Examples

When to Mark Examples

How Many Examples?

Step 5: Understanding Alignment

Individual Test Case Alignment

Alignment Calculation for Individual Test Cases

What's Good Alignment?

Analyzing Misalignment

Step 6: Refining and Saving Rubrics

How Rubric Refinement Works

Refining the Rubric

Save Options

Step 7: Re-running Tests

Step 8: Iteration

Complete Workflow Summary

Tips for Successful Refinement

1. Start with a Reasonable Baseline Rubric

2. Use Consistent Test Cases

3. Grade Honestly and Thoroughly

4. Focus on Patterns, Not Individual Cases

5. Use Examples Strategically

6. Review AI Rationales

7. Iterate with Purpose

8. Don't Chase 100% Alignment

9. Test with Fresh Data Eventually

Troubleshooting Common Issues

Issue: Alignment Not Improving

Issue: AI Is Consistently Too Lenient or Too Strict

Issue: Human and AI Focus on Different Criteria

Issue: Long Responses Are Hard to Grade

Issue: Test Cases Are Too Similar