Consolidate SKILL-REPORT filename matching into shared API module#1765
Consolidate SKILL-REPORT filename matching into shared API module#1765
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “Confidence Level per Skill” panel to the repository health dashboard by extracting the “Average Confidence” metric from per-skill SKILL-REPORT markdown files and surfacing it via the dashboard API.
Changes:
- Adds a new dashboard panel section for “Confidence Level per Skill”.
- Extends the dashboard client to fetch/render the per-skill confidence metric with pass/fail filtering and a progress-bar visualization.
- Updates the
/api/test-results/{date}function to parse SKILL-REPORT.md blobs and includeaverageConfidencein the per-skill JSON response.
Show a summary per file
| File | Description |
|---|---|
| dashboard/index.html | Adds the new “Confidence Level per Skill” panel container to the dashboard layout. |
| dashboard/assets/dashboard.js | Implements loading + rendering logic for the confidence-per-skill panel and hooks it into initialization. |
| dashboard/api/src/functions/getTestResults.ts | Parses SKILL-REPORT.md files to compute averageConfidence and returns it alongside existing skill stats. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 3
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Dashboard panel that surfaces “Confidence Level per Skill” by extracting the Average Confidence metric from per-skill *-SKILL-REPORT.md artifacts and exposing it via the existing test-results API.
Changes:
- Added a new “Confidence Level per Skill” panel section to the dashboard HTML layout.
- Implemented client-side loading/rendering for the confidence table (including filtering/collapsing behavior).
- Extended
GET /api/test-results/{date}to parse*-SKILL-REPORT.mdfiles and returnaverageConfidenceper skill.
Show a summary per file
| File | Description |
|---|---|
| dashboard/index.html | Adds a new dashboard panel container for “Confidence Level per Skill”. |
| dashboard/assets/dashboard.js | Fetches averageConfidence from /api/test-results/{date} and renders a sortable, filterable list with progress bars. |
| dashboard/api/src/functions/getTestResults.ts | Collects SKILL-REPORT blobs, extracts “Average Confidence”, and returns it as averageConfidence in the API response. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 2
dashboard/assets/dashboard.js
Outdated
| * A skill is considered passing when its average rate is >= SIR_THRESHOLD (80%). | ||
| */ | ||
| async function loadConfidenceLevelPerSkill() { | ||
| const section = document.getElementById("panel-confidence-level"); | ||
| if (!section) return; | ||
|
|
||
| try { | ||
| const { latestDate, skillResults } = await fetchLatestTestResults(); | ||
|
|
||
| if (!latestDate) { | ||
| renderConfidenceLevelPanel(section, [], "skip", 0, 0, null); | ||
| return; | ||
| } | ||
|
|
||
| // Build one entry per skill using its averageConfidence from the SKILL-REPORT | ||
| const skills = []; | ||
| for (const [skillName, stats] of Object.entries(skillResults)) { | ||
| if (stats.averageConfidence === null || stats.averageConfidence === undefined) continue; | ||
| // averageConfidence is stored as 0–100 in the API response | ||
| skills.push({ skillName, rate: stats.averageConfidence / 100 }); | ||
| } | ||
|
|
||
| // Sort: below threshold first (worst first), then ascending within each group | ||
| skills.sort(function (a, b) { | ||
| const aPass = a.rate >= SIR_THRESHOLD; | ||
| const bPass = b.rate >= SIR_THRESHOLD; | ||
| if (aPass !== bPass) return aPass ? 1 : -1; | ||
| return a.rate - b.rate; | ||
| }); |
There was a problem hiding this comment.
This panel reuses SIR_THRESHOLD / SIR_THRESHOLD_PCT (documented as the minimum passing rate for skill invocation tests) to judge/label “confidence level”. Even if the numeric threshold is intentionally the same (80%), it couples two different metrics and makes the code harder to reason about. Consider introducing a dedicated CONFIDENCE_THRESHOLD (and % constant) or renaming the shared constant to something metric-agnostic if it’s meant to apply to both.
Agent-Logs-Url: https://github.com/microsoft/GitHub-Copilot-for-Azure/sessions/58a5e2d6-7576-4192-9b1e-c91d17393cef Co-authored-by: msalaman <28122166+msalaman@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/GitHub-Copilot-for-Azure/sessions/58a5e2d6-7576-4192-9b1e-c91d17393cef Co-authored-by: msalaman <28122166+msalaman@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new dashboard panel to surface per-skill “Average Confidence” values (sourced from SKILL-REPORT artifacts) alongside existing skill health panels.
Changes:
- Add a new “Confidence Level per Skill” panel to the dashboard UI and initialize it on page load.
- Extend the dashboard API’s
/api/test-results/{date}response to includeaverageConfidenceper skill by parsing*-SKILL-REPORT.mdblobs. - Render a per-skill confidence table with pass/fail thresholding and sorting.
Show a summary per file
| File | Description |
|---|---|
| dashboard/index.html | Adds a new panel section for “Confidence Level per Skill”. |
| dashboard/assets/dashboard.js | Fetches averageConfidence and renders the new confidence panel with threshold-based status. |
| dashboard/api/src/functions/getTestResults.ts | Parses SKILL-REPORT markdown to compute and return averageConfidence per skill. |
Copilot's findings
- Files reviewed: 5/5 changed files
- Comments generated: 1
| const stats = computeSkillStats(results); | ||
| const confValues = confidenceBySkill.get(skillName); | ||
| if (confValues && confValues.length > 0) { | ||
| stats.averageConfidence = confValues.reduce((a, b) => a + b, 0) / confValues.length; | ||
| } |
There was a problem hiding this comment.
averageConfidence is computed as an unweighted average across all matching *-SKILL-REPORT.md files for a skill. On dates with multiple runs (or multiple azure-deploy test groups), this can skew the final value because each report contributes equally regardless of how many tests/runs it represents. Consider either selecting a single canonical report per skill (e.g., most recent) or computing a weighted average (e.g., weight by Total Individual Runs / Total Test Cases extracted from the report).
This PR is a follow-up on the dashboard confidence-level work (which reads
Average Confidencefrom per-skill*-SKILL-REPORT.mdfiles). It addresses review feedback by removing duplicated filename-matching logic in the API to avoid drift between endpoints.What changed
dashboard/api/src/skillReport.tswith a single exportedSKILL_REPORT_PATTERN.getReports.tsto import the shared pattern instead of defining a local regex.getTestResults.tsto import the same shared pattern.Result
SKILL-REPORTblobs, reducing maintenance risk as report naming evolves.