Add sample_log_num to Metric for debugging accuracy evaluation#2530
Merged
Conversation
4386010 to
9df0b52
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in “sample prediction logging” capability to Olive’s accuracy evaluation flow by extending the Metric config and wiring a JSONL logger into evaluator backends, enabling debugging of accuracy results by inspecting per-sample predictions vs. targets.
Changes:
- Extend
Metricconfig withsample_log_numandsample_log_dirto control sample logging output. - Add
OliveEvaluator.save_sample_log(...)and invoke it from accuracy evaluation paths for ONNX, PyTorch, OpenVINO, and QNN evaluators. - Add unit tests covering basic JSONL writing behavior for tensor and string predictions/targets.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
olive/evaluator/metric.py |
Adds sample_log_num / sample_log_dir fields to the Metric config model. |
olive/evaluator/olive_evaluator.py |
Implements save_sample_log and calls it from evaluator accuracy paths. |
test/evaluator/test_olive_evaluator.py |
Adds tests validating JSONL output for tensor/string data and sample-count capping. |
Comments suppressed due to low confidence (1)
olive/evaluator/olive_evaluator.py:688
sample_log_numis applied in_evaluate_onnx_accuracy, but the distributed ONNX accuracy path (_evaluate_distributed_accuracy) still returnscompute_accuracy(...)directly and never callssave_sample_log. This means users won't get sample logs when usingDistributedOnnxModelHandler, which seems inconsistent with the PR description that the feature is integrated into the ONNX evaluator backend.
else:
inference_output, targets = self._inference_text(
81aadef to
5fcacb8
Compare
jiafatom
added a commit
that referenced
this pull request
Jun 22, 2026
Extend the per-sample accuracy log (PR #2530) beyond index/prediction/ target to include the prompt and the vision/audio file name. - Add a generic per-sample `extras` channel to OliveModelOutput and merge it into save_sample_log between index and prediction. - Vision: add an `id_col` preprocessor param and emit {prompt, image} extras; file name falls back to the dataset row index when no id column is set. - Audio: speech_transcription_pre_process now yields {audio, file_name}; genai inference loops read it via _normalize_audio_batch, and non-genai ONNX paths revert to raw arrays via _unwrap_audio_input before format_data. - Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer index to answer_idx) so the file-name fallback uses the real row index. - Add unit tests for extras merging, audio input helpers, and file-name sourcing for both vision and audio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run Olive CI |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Add two new fields to the Metric config:
- sample_log_num: number of sample predictions to save (default 0, disabled)
- sample_log_dir: directory for the sample log file (defaults to CWD)
When sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is
saved with the first N predictions alongside their ground truth values.
This helps debug accuracy evaluation results by inspecting individual
sample predictions.
The feature is hooked into all four evaluator backends:
ONNX, PyTorch, OpenVINO, and QNN.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the per-sample accuracy log (PR #2530) beyond index/prediction/ target to include the prompt and the vision/audio file name. - Add a generic per-sample `extras` channel to OliveModelOutput and merge it into save_sample_log between index and prediction. - Vision: add an `id_col` preprocessor param and emit {prompt, image} extras; file name falls back to the dataset row index when no id column is set. - Audio: speech_transcription_pre_process now yields {audio, file_name}; genai inference loops read it via _normalize_audio_batch, and non-genai ONNX paths revert to raw arrays via _unwrap_audio_input before format_data. - Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer index to answer_idx) so the file-name fallback uses the real row index. - Add unit tests for extras merging, audio input helpers, and file-name sourcing for both vision and audio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Drop four function-local `import json` statements now that json is imported at module level (fixes pylint W0404 reimported / W0621 redefined-outer-name). - Wrap the long VisionVQADataset return for ruff-format. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7646b0e to
2e018f5
Compare
xiaoyu-work
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Add two new fields to the
Metricconfig class for debugging accuracy evaluation:sample_log_num(int, default0): Number of sample predictions to log alongside ground truth. When > 0, saves a JSONL file with the first N sample results.sample_log_dir(Optional[str], defaultNone): Directory to save the sample log file. Defaults to the current working directory.When
sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is written with each line containing the sampleindex,prediction, andtarget. For vision and audio (GenAI) tasks, each record is additionally enriched with the prompt and the media file name, so failures can be inspected without re-deriving the input:{"index": 14, "prompt": "This diagram shows the life cycle of an insect...\n1. B-A-C\n2. C-A-B\n3. A-B-C\n4. B-C-A", "image": "14", "prediction": "1", "target": "3"}{"index", "prediction", "target"}shape.promptandimage; audio tasks addaudio.This helps debug accuracy evaluation results by inspecting individual sample predictions vs ground truth. Works with tensor data (converted to Python values) and string data (text-based metrics like WER, exact_match, etc.).
The feature is integrated into all four evaluator backends: ONNX, PyTorch, OpenVINO, and QNN.
Sourcing the media file name (
id_col)The vision (
vision_vqa_pre_process) and audio (speech_transcription_pre_process) preprocessors accept an optionalid_colparameter naming a dataset column to use as the media file name. When unset (or absent in the row), the file name falls back to the HF audiopathbasename (audio) or the dataset row index (vision/audio). This makes the field useful even for datasets that embed media in-memory without a path.Implementation notes
extraschannel was added toOliveModelOutputand merged intosave_sample_logbetweenindexandprediction. It is backward-compatible (extrasdefaults toNone).extrasdirectly; audio metadata travels with the array as a{audio, file_name}dict and is unwrapped back to raw arrays for non-GenAI ONNX paths beforeformat_data.Example config usage
{ "name": "accuracy", "type": "accuracy", "sample_log_num": 10, "sample_log_dir": "/path/to/output", "sub_types": [{"name": "accuracy_score"}] }Checklist before requesting a review
lintrunner -aRelease note: Added
sample_log_numandsample_log_dirfields toMetricconfig, allowing users to save the first N sample predictions alongside ground truth to a JSONL file during accuracy evaluation for debugging purposes. For vision/audio tasks the log is enriched with the prompt and media file name, and the vision/audio preprocessors gain an optionalid_colparameter to source the file name.