Conversation
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.
After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.
| <!-- <interview-phase-1> --> | ||
| 1. What is the name of the AI agent you are evaluating? If it does not have a name yet, give it one. | ||
| 2. What specific business problem or scenario does this agent address? | ||
| 3. What are the business KPIs associated with this agent (for example, increase revenue, decrease costs, transform business process)? | ||
| 4. What tasks is this agent designed to perform? What is explicitly out of scope? | ||
| 5. What are key risks (Responsible AI Framework) in implementing this agent (for example, PII vulnerabilities, negative impact from model inaccuracy)? | ||
| 6. Who are the primary users of this agent? How likely is this agent to be adopted by primary users? What are barriers to adoption? | ||
| <!-- </interview-phase-1> --> |
There was a problem hiding this comment.
The XML comment boundaries (<!-- <interview-phase-1> --> … <!-- </interview-phase-1> -->) work as section markers, but the pattern used by other agents in this repo is to express the workflow as an enumerated Required Protocol that spells out each rule or constraint as a numbered item. The current Required Protocol section at the bottom of this file has four items, which is a good start.
Consider moving more of the behavioral expectations from the XML-bounded sections into the protocol list or into the phase headings themselves. For examples of how other agents structure this, see:
.github/agents/hve-core/subagents/phase-implementor.agent.md— Required Protocol with numbered invariants that are referenced from the Required Steps..github/agents/hve-core/subagents/prompt-evaluator.agent.md— Required Protocol for evaluation-specific constraints paired with Required Steps.
This would make the constraints directly visible and enumerable rather than embedded in template comment tags.
There was a problem hiding this comment.
@WilliamBerryiii The workflow is already expressed as an enumerated Required Protocol. It also has XML comment boundaries. I can remove the XML comment boundaries, but it is unclear how to move more of the behavioral expectations into the protocol list or into the phase heading themselves.
There was a problem hiding this comment.
@bjcmit The interview questions stay in the phases — those are content. What moves is the behavioral rules currently scattered through the prose. For example:
"Ask questions one at a time and wait for user responses" (line 36)
"Proceed to Phase 2 after all six questions are answered" (line 49)
"Return to Phase 5 if the user requests regeneration" (line 175)
These are execution constraints, not phase content. When they're buried in narrative, the model can miss them. Pulling them into the Required Protocol makes them enumerable and auditable in one place.
Here's what the updated Required Protocol would look like:
Required Protocol
Do not skip interview questions or assume answers.
Present interview questions one at a time and wait for the user's response before asking the next question.
Do not proceed to the next phase until all questions in the current phase are answered.
Do not generate any artifacts until the interview (Phases 1–4) is complete.
Create the \data/evaluation/\ directory structure if it does not exist.
Generate both JSON and CSV dataset formats.
During dataset review (Phase 6), present 5–8 representative Q&A pairs; return to Phase 5 if the user requests regeneration.
Tailor metric selection based on agent characteristics discovered during the interview, and recommend tooling based on the stated persona.
After generating all documentation, present a summary listing every artifact created with its path.
Then remove the inline transition sentences ("Proceed to Phase 2 after all six questions are answered", etc.) from the phase bodies since the protocol already covers them. The XML comment boundaries can go too — the protocol governs the flow now.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1279 +/- ##
==========================================
- Coverage 87.66% 87.65% -0.02%
==========================================
Files 61 61
Lines 9328 9328
==========================================
- Hits 8177 8176 -1
- Misses 1151 1152 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Description
This pull request adds a comprehensive new prompt,
eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.Key additions and improvements:
Evaluation Dataset Creation Workflow:
Dataset and Documentation Artifacts:
data/evaluation/with separate subfolders for datasets (.json,.csv) and documentation (curation-notes.md,metric-selection.md,tool-recommendations.md).Tooling and Persona Guidance:
Related Issue(s)
Closes #1267
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/agents/*.agent.md)Sample Prompts (for AI Artifact Contributions)
User Request:
Execution Flow:
Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:
Purpose: Gather all necessary context before generating any artifacts.
Phase 1: Agent Context**
Phase 2: Agent Capabilities
Phase 3: Evaluation Scenarios
Phase 4: Persona & Tooling
data/evaluation/datasets/.data/evaluation/docs/:data/evaluation/docs/.Decision Points & Tool Usage Summary
Output Artifacts:
data/evaluation/datasets/-eval-dataset.json
{ "metadata": { "schema_version": "1", "agent_name": "example-agent", "created_date": "2026-04-02", "version": "1.0.0", "total_pairs": 30, "distribution": { "easy": 6, "grounding_source_checks": 3, "hard": 12, "negative": 6, "safety": 3 }, "persona": "pro-code", "evaluation_mode": ["manual", "batch"], "recommended_tool": "azure-ai-foundry" }, "evaluation_pairs": [ {data/evaluation/docs/-curation-notes.md
data/evaluation/docs/-metric-selection.md
data/evaluation/docs/-tool-recommendations.md
Success Indicators:
Testing
/prompt-analyze3 times with all findings addressednpm run lint:all✅npm run lint:md-links✅npm run validate:copyright✅ (148/148 files, 100%)npm run spell-check✅ (281 files, 0 issues)npm run plugin:generate✅ (14 plugins, 0 errors)npm run plugin:validate✅ (0 errors)npm run lint:collections-metadata✅ (0 errors)Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
The following validation commands must pass before merging:
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generateSecurity Considerations