Skip to content

Add sample_log_num to Metric for debugging accuracy evaluation#2530

Merged
jiafatom merged 5 commits into
mainfrom
jiafa/add-sample-log-evaluator
Jun 24, 2026
Merged

Add sample_log_num to Metric for debugging accuracy evaluation#2530
jiafatom merged 5 commits into
mainfrom
jiafa/add-sample-log-evaluator

Conversation

@jiafatom

@jiafatom jiafatom commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Describe your changes

Add two new fields to the Metric config class for debugging accuracy evaluation:

  • sample_log_num (int, default 0): Number of sample predictions to log alongside ground truth. When > 0, saves a JSONL file with the first N sample results.
  • sample_log_dir (Optional[str], default None): Directory to save the sample log file. Defaults to the current working directory.

When sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is written with each line containing the sample index, prediction, and target. For vision and audio (GenAI) tasks, each record is additionally enriched with the prompt and the media file name, so failures can be inspected without re-deriving the input:

{"index": 14, "prompt": "This diagram shows the life cycle of an insect...\n1. B-A-C\n2. C-A-B\n3. A-B-C\n4. B-C-A", "image": "14", "prediction": "1", "target": "3"}
  • Text metrics keep the original {"index", "prediction", "target"} shape.
  • Vision tasks add prompt and image; audio tasks add audio.

This helps debug accuracy evaluation results by inspecting individual sample predictions vs ground truth. Works with tensor data (converted to Python values) and string data (text-based metrics like WER, exact_match, etc.).

The feature is integrated into all four evaluator backends: ONNX, PyTorch, OpenVINO, and QNN.

Sourcing the media file name (id_col)

The vision (vision_vqa_pre_process) and audio (speech_transcription_pre_process) preprocessors accept an optional id_col parameter naming a dataset column to use as the media file name. When unset (or absent in the row), the file name falls back to the HF audio path basename (audio) or the dataset row index (vision/audio). This makes the field useful even for datasets that embed media in-memory without a path.

Implementation notes

  • A generic per-sample extras channel was added to OliveModelOutput and merged into save_sample_log between index and prediction. It is backward-compatible (extras defaults to None).
  • GenAI inference loops populate extras directly; audio metadata travels with the array as a {audio, file_name} dict and is unwrapped back to raw arrays for non-GenAI ONNX paths before format_data.

Example config usage

{
  "name": "accuracy",
  "type": "accuracy",
  "sample_log_num": 10,
  "sample_log_dir": "/path/to/output",
  "sub_types": [{"name": "accuracy_score"}]
}

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: Added sample_log_num and sample_log_dir fields to Metric config, allowing users to save the first N sample predictions alongside ground truth to a JSONL file during accuracy evaluation for debugging purposes. For vision/audio tasks the log is enriched with the prompt and media file name, and the vision/audio preprocessors gain an optional id_col parameter to source the file name.

Copilot AI review requested due to automatic review settings June 19, 2026 18:35
@jiafatom jiafatom force-pushed the jiafa/add-sample-log-evaluator branch from 4386010 to 9df0b52 Compare June 19, 2026 18:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in “sample prediction logging” capability to Olive’s accuracy evaluation flow by extending the Metric config and wiring a JSONL logger into evaluator backends, enabling debugging of accuracy results by inspecting per-sample predictions vs. targets.

Changes:

  • Extend Metric config with sample_log_num and sample_log_dir to control sample logging output.
  • Add OliveEvaluator.save_sample_log(...) and invoke it from accuracy evaluation paths for ONNX, PyTorch, OpenVINO, and QNN evaluators.
  • Add unit tests covering basic JSONL writing behavior for tensor and string predictions/targets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
olive/evaluator/metric.py Adds sample_log_num / sample_log_dir fields to the Metric config model.
olive/evaluator/olive_evaluator.py Implements save_sample_log and calls it from evaluator accuracy paths.
test/evaluator/test_olive_evaluator.py Adds tests validating JSONL output for tensor/string data and sample-count capping.
Comments suppressed due to low confidence (1)

olive/evaluator/olive_evaluator.py:688

  • sample_log_num is applied in _evaluate_onnx_accuracy, but the distributed ONNX accuracy path (_evaluate_distributed_accuracy) still returns compute_accuracy(...) directly and never calls save_sample_log. This means users won't get sample logs when using DistributedOnnxModelHandler, which seems inconsistent with the PR description that the feature is integrated into the ONNX evaluator backend.
            else:
                inference_output, targets = self._inference_text(

Comment thread olive/evaluator/olive_evaluator.py Outdated
Comment thread olive/evaluator/olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
@jiafatom jiafatom force-pushed the jiafa/add-sample-log-evaluator branch 3 times, most recently from 81aadef to 5fcacb8 Compare June 20, 2026 01:33
Comment thread olive/evaluator/olive_evaluator.py Fixed
jiafatom added a commit that referenced this pull request Jun 22, 2026
Extend the per-sample accuracy log (PR #2530) beyond index/prediction/
target to include the prompt and the vision/audio file name.

- Add a generic per-sample `extras` channel to OliveModelOutput and merge
  it into save_sample_log between index and prediction.
- Vision: add an `id_col` preprocessor param and emit {prompt, image} extras;
  file name falls back to the dataset row index when no id column is set.
- Audio: speech_transcription_pre_process now yields {audio, file_name};
  genai inference loops read it via _normalize_audio_batch, and non-genai
  ONNX paths revert to raw arrays via _unwrap_audio_input before format_data.
- Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer
  index to answer_idx) so the file-name fallback uses the real row index.
- Add unit tests for extras merging, audio input helpers, and file-name
  sourcing for both vision and audio.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom

Copy link
Copy Markdown
Contributor Author

/azp run Olive CI

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

jiafatom and others added 5 commits June 24, 2026 15:30
Add two new fields to the Metric config:
- sample_log_num: number of sample predictions to save (default 0, disabled)
- sample_log_dir: directory for the sample log file (defaults to CWD)

When sample_log_num > 0, a JSONL file ({metric_name}_samples.jsonl) is
saved with the first N predictions alongside their ground truth values.
This helps debug accuracy evaluation results by inspecting individual
sample predictions.

The feature is hooked into all four evaluator backends:
ONNX, PyTorch, OpenVINO, and QNN.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the per-sample accuracy log (PR #2530) beyond index/prediction/
target to include the prompt and the vision/audio file name.

- Add a generic per-sample `extras` channel to OliveModelOutput and merge
  it into save_sample_log between index and prediction.
- Vision: add an `id_col` preprocessor param and emit {prompt, image} extras;
  file name falls back to the dataset row index when no id column is set.
- Audio: speech_transcription_pre_process now yields {audio, file_name};
  genai inference loops read it via _normalize_audio_batch, and non-genai
  ONNX paths revert to raw arrays via _unwrap_audio_input before format_data.
- Fix idx shadowing in VisionVQADataset.__getitem__ (rename inner answer
  index to answer_idx) so the file-name fallback uses the real row index.
- Add unit tests for extras merging, audio input helpers, and file-name
  sourcing for both vision and audio.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Drop four function-local `import json` statements now that json is
  imported at module level (fixes pylint W0404 reimported / W0621
  redefined-outer-name).
- Wrap the long VisionVQADataset return for ruff-format.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom force-pushed the jiafa/add-sample-log-evaluator branch from 7646b0e to 2e018f5 Compare June 24, 2026 15:30
@jiafatom jiafatom merged commit 8e45835 into main Jun 24, 2026
13 checks passed
@jiafatom jiafatom deleted the jiafa/add-sample-log-evaluator branch June 24, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants