Add sample_log_num to Metric for debugging accuracy evaluation by jiafatom · Pull Request #2530 · microsoft/Olive

jiafatom · 2026-06-19T18:35:02Z

Describe your changes

Add two new fields to the Metric config class for debugging accuracy evaluation:

sample_log_num (int, default 0): Number of sample predictions to log alongside ground truth. When > 0, saves a JSONL file with the first N sample results.
sample_log_dir (Optional[str], default None): Directory to save the sample log file. Defaults to the current working directory.

When sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is written with each line containing the sample index, prediction, and target. For vision and audio (GenAI) tasks, each record is additionally enriched with the prompt and the media file name, so failures can be inspected without re-deriving the input:

{"index": 14, "prompt": "This diagram shows the life cycle of an insect...\n1. B-A-C\n2. C-A-B\n3. A-B-C\n4. B-C-A", "image": "14", "prediction": "1", "target": "3"}

Text metrics keep the original {"index", "prediction", "target"} shape.
Vision tasks add prompt and image; audio tasks add audio.

This helps debug accuracy evaluation results by inspecting individual sample predictions vs ground truth. Works with tensor data (converted to Python values) and string data (text-based metrics like WER, exact_match, etc.).

The feature is integrated into all four evaluator backends: ONNX, PyTorch, OpenVINO, and QNN.

Sourcing the media file name (`id_col`)

The vision (vision_vqa_pre_process) and audio (speech_transcription_pre_process) preprocessors accept an optional id_col parameter naming a dataset column to use as the media file name. When unset (or absent in the row), the file name falls back to the HF audio path basename (audio) or the dataset row index (vision/audio). This makes the field useful even for datasets that embed media in-memory without a path.

Implementation notes

A generic per-sample extras channel was added to OliveModelOutput and merged into save_sample_log between index and prediction. It is backward-compatible (extras defaults to None).
GenAI inference loops populate extras directly; audio metadata travels with the array as a {audio, file_name} dict and is unwrapped back to raw arrays for non-GenAI ONNX paths before format_data.

Example config usage

{
  "name": "accuracy",
  "type": "accuracy",
  "sample_log_num": 10,
  "sample_log_dir": "/path/to/output",
  "sub_types": [{"name": "accuracy_score"}]
}

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: Added sample_log_num and sample_log_dir fields to Metric config, allowing users to save the first N sample predictions alongside ground truth to a JSONL file during accuracy evaluation for debugging purposes. For vision/audio tasks the log is enriched with the prompt and media file name, and the vision/audio preprocessors gain an optional id_col parameter to source the file name.

Copilot

Pull request overview

This PR adds an opt-in “sample prediction logging” capability to Olive’s accuracy evaluation flow by extending the Metric config and wiring a JSONL logger into evaluator backends, enabling debugging of accuracy results by inspecting per-sample predictions vs. targets.

Changes:

Extend Metric config with sample_log_num and sample_log_dir to control sample logging output.
Add OliveEvaluator.save_sample_log(...) and invoke it from accuracy evaluation paths for ONNX, PyTorch, OpenVINO, and QNN evaluators.
Add unit tests covering basic JSONL writing behavior for tensor and string predictions/targets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`olive/evaluator/metric.py`	Adds `sample_log_num` / `sample_log_dir` fields to the `Metric` config model.
`olive/evaluator/olive_evaluator.py`	Implements `save_sample_log` and calls it from evaluator accuracy paths.
`test/evaluator/test_olive_evaluator.py`	Adds tests validating JSONL output for tensor/string data and sample-count capping.

Comments suppressed due to low confidence (1)

olive/evaluator/olive_evaluator.py:688

sample_log_num is applied in _evaluate_onnx_accuracy, but the distributed ONNX accuracy path (_evaluate_distributed_accuracy) still returns compute_accuracy(...) directly and never calls save_sample_log. This means users won't get sample logs when using DistributedOnnxModelHandler, which seems inconsistent with the PR description that the feature is integrated into the ONNX evaluator backend.

            else:
                inference_output, targets = self._inference_text(

Extend the per-sample accuracy log (PR #2530) beyond index/prediction/ target to include the prompt and the vision/audio file name. - Add a generic per-sample `extras` channel to OliveModelOutput and merge it into save_sample_log between index and prediction. - Vision: add an `id_col` preprocessor param and emit {prompt, image} extras; file name falls back to the dataset row index when no id column is set. - Audio: speech_transcription_pre_process now yields {audio, file_name}; genai inference loops read it via _normalize_audio_batch, and non-genai ONNX paths revert to raw arrays via _unwrap_audio_input before format_data. - Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer index to answer_idx) so the file-name fallback uses the real row index. - Add unit tests for extras merging, audio input helpers, and file-name sourcing for both vision and audio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom · 2026-06-23T00:00:21Z

/azp run Olive CI

azure-pipelines · 2026-06-23T00:00:31Z

Azure Pipelines successfully started running 1 pipeline(s).

Add two new fields to the Metric config: - sample_log_num: number of sample predictions to save (default 0, disabled) - sample_log_dir: directory for the sample log file (defaults to CWD) When sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is saved with the first N predictions alongside their ground truth values. This helps debug accuracy evaluation results by inspecting individual sample predictions. The feature is hooked into all four evaluator backends: ONNX, PyTorch, OpenVINO, and QNN. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extend the per-sample accuracy log (PR #2530) beyond index/prediction/ target to include the prompt and the vision/audio file name. - Add a generic per-sample `extras` channel to OliveModelOutput and merge it into save_sample_log between index and prediction. - Vision: add an `id_col` preprocessor param and emit {prompt, image} extras; file name falls back to the dataset row index when no id column is set. - Audio: speech_transcription_pre_process now yields {audio, file_name}; genai inference loops read it via _normalize_audio_batch, and non-genai ONNX paths revert to raw arrays via _unwrap_audio_input before format_data. - Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer index to answer_idx) so the file-name fallback uses the real row index. - Add unit tests for extras merging, audio input helpers, and file-name sourcing for both vision and audio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Drop four function-local `import json` statements now that json is imported at module level (fixes pylint W0404 reimported / W0621 redefined-outer-name). - Wrap the long VisionVQADataset return for ruff-format. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 19, 2026 18:35

Copilot started reviewing on behalf of jiafatom June 19, 2026 18:35 View session

jiafatom force-pushed the jiafa/add-sample-log-evaluator branch from 4386010 to 9df0b52 Compare June 19, 2026 18:36

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py Outdated

github-advanced-security AI found potential problems Jun 19, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py Fixed

Comment thread test/evaluator/test_olive_evaluator.py Fixed

Comment thread test/evaluator/test_olive_evaluator.py Fixed

jiafatom force-pushed the jiafa/add-sample-log-evaluator branch 3 times, most recently from 81aadef to 5fcacb8 Compare June 20, 2026 01:33

github-advanced-security AI found potential problems Jun 20, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py Fixed

jiafatom and others added 5 commits June 24, 2026 15:30

Fix vision detection tests by setting sample_log_num on mock metric

e5ae6b9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix lintrunner errors: remove unused noqa and reformat save_sample_log

aee12d3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom force-pushed the jiafa/add-sample-log-evaluator branch from 7646b0e to 2e018f5 Compare June 24, 2026 15:30

xiaoyu-work approved these changes Jun 24, 2026

View reviewed changes

jiafatom merged commit 8e45835 into main Jun 24, 2026
13 checks passed

jiafatom deleted the jiafa/add-sample-log-evaluator branch June 24, 2026 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sample_log_num to Metric for debugging accuracy evaluation#2530

Add sample_log_num to Metric for debugging accuracy evaluation#2530
jiafatom merged 5 commits into
mainfrom
jiafa/add-sample-log-evaluator

jiafatom commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiafatom commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jiafatom commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Sourcing the media file name (id_col)

Implementation notes

Example config usage

Checklist before requesting a review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiafatom commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiafatom commented Jun 19, 2026 •

edited

Loading

Sourcing the media file name (`id_col`)