Skip to content

Add LMMSEvaluator (integrate lmms-eval toolkit) for multimodal (vision+audio) evaluation#2531

Draft
DelwinKim wants to merge 6 commits into
microsoft:mainfrom
DelwinKim:t-delwinkim/lmms-ort-evaluator
Draft

Add LMMSEvaluator (integrate lmms-eval toolkit) for multimodal (vision+audio) evaluation#2531
DelwinKim wants to merge 6 commits into
microsoft:mainfrom
DelwinKim:t-delwinkim/lmms-ort-evaluator

Conversation

@DelwinKim

Copy link
Copy Markdown

Describe your changes

This PR adds multimodal (vision + audio) evaluation to Olive by integrating the lmms-eval harness as an Olive evaluator, plus the supporting pass needed to make quantized multi-component ORT-GenAI packages evaluable.

What's added

1. LMMSEvaluator (olive/evaluator/olive_evaluator.py)
A new evaluator ("type": "LMMSEvaluator") that runs lmms-eval benchmarks from within a single olive run (should also just be able run eval in a standalone script if needed). It supports two model handler types:

  • ONNXModelHandler pointing at an ORT-GenAI multimodal package (genai_config.json + quantized ONNX, e.g. from MobiusBuilder + OnnxKQuantQuantization) → dispatches to the new ortgenai_mm adapter.
  • HfModelHandler for HuggingFace PyTorch multimodal models → dispatches to lmms-eval's native wrapper, auto-detected from the HF model_type (overridable via model_class).

Benchmarks are selected purely by config ("tasks": ["ocrbench", "docvqa_val_lite", "librispeech_test_clean", "fleurs_en", ...]), and the harness's official scorers are used (OCRBench scoring, DocVQA ANLS, WER, etc.), so results can be more easily compared.

2. ortgenai_mm adapter (olive/evaluator/lmms_ort.py)
A lmms-eval model adapter that drives ORT-GenAI multimodal generation. Implements generate_until and loglikelihood, handles image and audio inputs (og.Images / og.Audios), per-model-type processor argument shapes (Phi-4-MM vs Whisper), and EOS/stop handling (including Whisper's BOS==EOS collision).

3. CompositeToOnnxPackage pass (olive/passes/onnx/composite_to_onnx_package.py)
MobiusBuilder emits a multi-component CompositeModelHandler (decoder / vision / audio / embedding in subdirectories). A composite can't be evaluated directly — it has no single session, and the nested layout defeats ORT-GenAI's genai_config.json auto-detection. This pass flattens the components into a single-directory ORT-GenAI package (rewriting external-data references and genai_config.json) and returns a runnable ONNXModelHandler, so the evaluator can load it.

4. Registration

  • olive/olive_config.json: registers the new evaluator and pass.

Example recipe

A complete example recipe — build + quantize Gemma‑4 E2B, flatten to an ORT‑GenAI package, then evaluate vision and audio benchmarks:

{
  "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "accelerators": [
        { "device": "gpu", "execution_providers": ["CUDAExecutionProvider"] }
      ]
    }
  },
  "passes": {
    "mobius_build":  { "type": "MobiusBuilder", "precision": "fp16" },
    "int4_quantize": { "type": "OnnxKQuantQuantization", "bits": 4, "block_size": 32, "save_as_external_data": true },
    "flatten":       { "type": "CompositeToOnnxPackage" }
  },
  "evaluators": {
    "evaluator": {
      "type": "LMMSEvaluator",
      "tasks": ["ocrbench", "docvqa_val_lite", "librispeech_test_clean", "fleurs_en"],
      "batch_size": 1,
      "max_new_tokens": 256,
      "max_length": 4096,
      "limit": 10,
      "log_samples": false,
      "image_token_format": "<|image|>",
      "audio_token_format": "<|audio|>",
      "output_path": "results/gemma4_vision_audio.json"
    }
  },
  "evaluator": "evaluator",
  "evaluate_input_model": false,
  "target": "local_system",
  "output_dir": "models/gemma4_vision_audio",
  "cache_dir": "cache/gemma4_vision_audio",
  "no_artifacts": true
}

LMMSEvaluator config reference

Field Default Applies to Description
tasks (required) both List of lmms-eval task names, e.g. ocrbench, docvqa_val_lite, chartqa_lite, ai2d_lite, mmmu_val, textvqa_val, librispeech_test_clean, librispeech_test_other, fleurs_en.
limit None (full set) both Cap the number of samples per task.
batch_size 1 both Currently only 1 is supported (see limitations).
max_new_tokens 256 both Max tokens generated per generate_until request.
max_length 32768 both Total sequence budget (prompt + multimodal embeds + completion). Image/audio embeds can be 1000+ tokens, so keep this generous; lower it (e.g. 4096) to bound memory.
image_token_format "<|image_{index}|>" ONNX/genai Model-specific image placeholder. The adapter pre-renders media tokens into the prompt itself, then calls apply_chat_template only for the system/user turn scaffolding \u2014 so this must match the model's expected token. Default is Phi-4-MM style (<|image_1|>); Gemma-4 requires <|image|>. Wrong value \u21d2 the processor can't bind image tokens to images and the run fails.
audio_token_format "<|audio_{index}|>" ONNX/genai Model-specific audio placeholder; same mechanism/caveats as image_token_format. Gemma-4 requires <|audio|>.

Vision-only or audio-only runs are the same recipe with the corresponding subset of tasks. Audio tasks (librispeech, fleurs, covost) additionally require FFmpeg available at runtime (HF datasets decodes audio via torchcodec).

Tests (not very complete or thorough)

  • test/evaluator/test_lmms_ort.py — adapter unit tests (generation, loglikelihood, image/audio handling, processor branching, EOS logic).
  • test/passes/onnx/test_composite_to_onnx_package.py — flatten pass (external-data rewrite, genai_config rewrite, entry-point component selection).

Notes / limitations

  • batch_size is currently fixed at 1: ORT-GenAI's per-request Generator + set_inputs flow is single-sequence, and batching ragged, per-request multimodal (image/audio) inputs isn't currently supported by runtime.
  • lmms-eval is imported lazily, so non-multimodal Olive usage is unaffected if it isn't installed. Running the LMMSEvaluator (and the full adapter test suite) requires pip install lmms-eval.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass. (Adapter tests require lmms-eval installed; mocked tests pass without it, but the tests that instantiate the real adapter need the dependency.)
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
    • Release note: Add LMMSEvaluator for multimodal (vision + audio) evaluation via the lmms-eval harness, and CompositeToOnnxPackage to flatten multi-component ORT-GenAI packages for evaluation.

(Optional) Issue link

DelwinKim and others added 5 commits June 19, 2026 18:04
 Adds LMMSEvaluator (olive/evaluator/olive_evaluator.py) and an
 ORT-GenAI multimodal adapter (olive/evaluator/lmms_ort.py) for
 evaluating multimodal ONNX models via lmms-eval.
Build on top of the LMMSEvaluator + ORT-GenAI multimodal adapter foundation:

- LMMSEvaluator now dispatches HfModelHandler inputs to lmms-eval's native
  per-architecture wrappers (phi4_multimodal, qwen2_5_vl, whisper, ...),
  with auto-detection from HF model_type and a forwarded-kwargs filter
  that only passes args the target wrapper actually declares (handles
  wrappers like qwen2_5_vl which assert kwargs == {}). Enables
  FP-vs-quantized comparison in a single recipe via evaluate_input_model.

- lmms_ort.py adapter: tolerant audio/image disambiguation (audio dicts
  with "path" no longer get mis-routed to PIL.Image.open), Whisper-specific
  prompt + EOS-collision handling so ASR works end-to-end through
  ortgenai_mm without the Phi-4-MM chat-template scaffolding interfering.

- New CompositeToOnnxPackage pass: flattens nested CompositeModel ORT-GenAI
  packages (subdir-per-component or root-level) into the flat layout
  LMMSEvaluator expects. Tolerates extensionless component filenames
  produced by some upstream quant passes.

- Tests: 32 in test_lmms_ort.py (entry-point/registry, HF dispatch,
  kwargs filter, prompt builder, score_continuation, partition_visuals,
  run_generation), 9 in test_composite_to_onnx_package.py (flatten +
  external-data rewrites + fallback entry-point).

Validated end-to-end:
- whisper-large-v3 via HfModel -> ModelBuilder fp16 -> KQuant int8 ->
  CompositeToOnnxPackage -> ortgenai_mm eval on LibriSpeech.
  FP HF WER 1.52/2.26 (clean/other), INT8 ONNX WER 1.68/2.36.
…ocessor args

- MobiusBuilder: add `mobius_ep_override` config knob. Lets a workflow force
  the mobius execution_provider (e.g. "default") independent of the Olive
  accelerator EP. Needed because mobius's cuda-EP attention fusions
  (PackedMultiHeadAttention for Qwen2.5-VL vision, GQA for Gemma-4 decoder)
  produce graphs the ORT-GenAI fused-attention kernels reject. "default" EP
  skips those fusions; the resulting INT4 graph is numerically equivalent.
- lmms_ort: support torchcodec.AudioDecoder visuals (HF datasets 5.x audio
  feature) in _normalize_audio via duck-typed get_all_samples().
- lmms_ort: branch processor-arg shape on model type - Phi-4-MM needs a bare
  string, Whisper needs [prompt]. Passing a list to Phi-4-MM raised
  "Number of image tokens does not match the number of images".

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the setup.py lmms_eval.models entry point, the _model_manifest
factory, and its registration tests. The Olive LMMSEvaluator path imports
LMMSORTGenAIEvaluator directly, so the entry point only affected the
standalone lmms-eval CLI; dropping it keeps setup.py out of this change.
@microsoft-github-policy-service

Copy link
Copy Markdown
Contributor

@DelwinKim please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

  1. Definitions.
    “Code” means the computer software code, whether in human-readable or machine-executable form,
    that is delivered by You to Microsoft under this Agreement.
    “Project” means any of the projects owned or managed by Microsoft and offered under a license
    approved by the Open Source Initiative (www.opensource.org).
    “Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
    Project, including but not limited to communication on electronic mailing lists, source code control
    systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
    discussing and improving that Project, but excluding communication that is conspicuously marked or
    otherwise designated in writing by You as “Not a Submission.”
    “Submission” means the Code and any other copyrightable material Submitted by You, including any
    associated comments and documentation.
  2. Your Submission. You must agree to the terms of this Agreement before making a Submission to any
    Project. This Agreement covers any and all Submissions that You, now or in the future (except as
    described in Section 4 below), Submit to any Project.
  3. Originality of Work. You represent that each of Your Submissions is entirely Your original work.
    Should You wish to Submit materials that are not Your original work, You may Submit them separately
    to the Project if You (a) retain all copyright and license information that was in the materials as You
    received them, (b) in the description accompanying Your Submission, include the phrase “Submission
    containing materials of a third party:” followed by the names of the third party and any licenses or other
    restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
    guidelines concerning Submissions.
  4. Your Employer. References to “employer” in this Agreement include Your employer or anyone else
    for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
    Submission is made in the course of Your work for an employer or Your employer has intellectual
    property rights in Your Submission by contract or applicable law, You must secure permission from Your
    employer to make the Submission before signing this Agreement. In that case, the term “You” in this
    Agreement will refer to You and the employer collectively. If You change employers in the future and
    desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
    and secure permission from the new employer before Submitting those Submissions.
  5. Licenses.
  • Copyright License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
    Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
    the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
    parties.
  • Patent License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
    Your patent claims that are necessarily infringed by the Submission or the combination of the
    Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
    import or otherwise dispose of the Submission alone or with the Project.
  • Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
    No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
    granted by implication, exhaustion, estoppel or otherwise.
  1. Representations and Warranties. You represent that You are legally entitled to grant the above
    licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
    have disclosed under Section 3). You represent that You have secured permission from Your employer to
    make the Submission in cases where Your Submission is made in the course of Your work for Your
    employer or Your employer has intellectual property rights in Your Submission by contract or applicable
    law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
    have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
    You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
    REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
    EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
    PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
    NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
  2. Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
    You later become aware that would make Your representations in this Agreement inaccurate in any
    respect.
  3. Information about Submissions. You agree that contributions to Projects and information about
    contributions may be maintained indefinitely and disclosed publicly, including Your name and other
    information that You submit with Your Submission.
  4. Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
    the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
    Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
    exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
    defenses of lack of personal jurisdiction and forum non-conveniens.
  5. Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
    supersedes any and all prior agreements, understandings or communications, written or oral, between
    the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Comment on lines +198 to +205
return prompt_template.format(
system_prompt=system_prompt,
user_content=user_content,
text=user_text,
image_tokens=image_tokens,
audio_tokens=audio_tokens,
model_type=model_type,
)
Avoids constructing a real HfModelHandler (which would require a real HF
model on disk) while still exercising the dispatch logic.
"""
import olive.evaluator.olive_evaluator as oe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants