Add LMMSEvaluator (integrate lmms-eval toolkit) for multimodal (vision+audio) evaluation#2531
Draft
DelwinKim wants to merge 6 commits into
Draft
Add LMMSEvaluator (integrate lmms-eval toolkit) for multimodal (vision+audio) evaluation#2531DelwinKim wants to merge 6 commits into
DelwinKim wants to merge 6 commits into
Conversation
Adds LMMSEvaluator (olive/evaluator/olive_evaluator.py) and an ORT-GenAI multimodal adapter (olive/evaluator/lmms_ort.py) for evaluating multimodal ONNX models via lmms-eval.
Build on top of the LMMSEvaluator + ORT-GenAI multimodal adapter foundation:
- LMMSEvaluator now dispatches HfModelHandler inputs to lmms-eval's native
per-architecture wrappers (phi4_multimodal, qwen2_5_vl, whisper, ...),
with auto-detection from HF model_type and a forwarded-kwargs filter
that only passes args the target wrapper actually declares (handles
wrappers like qwen2_5_vl which assert kwargs == {}). Enables
FP-vs-quantized comparison in a single recipe via evaluate_input_model.
- lmms_ort.py adapter: tolerant audio/image disambiguation (audio dicts
with "path" no longer get mis-routed to PIL.Image.open), Whisper-specific
prompt + EOS-collision handling so ASR works end-to-end through
ortgenai_mm without the Phi-4-MM chat-template scaffolding interfering.
- New CompositeToOnnxPackage pass: flattens nested CompositeModel ORT-GenAI
packages (subdir-per-component or root-level) into the flat layout
LMMSEvaluator expects. Tolerates extensionless component filenames
produced by some upstream quant passes.
- Tests: 32 in test_lmms_ort.py (entry-point/registry, HF dispatch,
kwargs filter, prompt builder, score_continuation, partition_visuals,
run_generation), 9 in test_composite_to_onnx_package.py (flatten +
external-data rewrites + fallback entry-point).
Validated end-to-end:
- whisper-large-v3 via HfModel -> ModelBuilder fp16 -> KQuant int8 ->
CompositeToOnnxPackage -> ortgenai_mm eval on LibriSpeech.
FP HF WER 1.52/2.26 (clean/other), INT8 ONNX WER 1.68/2.36.
…ocessor args - MobiusBuilder: add `mobius_ep_override` config knob. Lets a workflow force the mobius execution_provider (e.g. "default") independent of the Olive accelerator EP. Needed because mobius's cuda-EP attention fusions (PackedMultiHeadAttention for Qwen2.5-VL vision, GQA for Gemma-4 decoder) produce graphs the ORT-GenAI fused-attention kernels reject. "default" EP skips those fusions; the resulting INT4 graph is numerically equivalent. - lmms_ort: support torchcodec.AudioDecoder visuals (HF datasets 5.x audio feature) in _normalize_audio via duck-typed get_all_samples(). - lmms_ort: branch processor-arg shape on model type - Phi-4-MM needs a bare string, Whisper needs [prompt]. Passing a list to Phi-4-MM raised "Number of image tokens does not match the number of images". Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the setup.py lmms_eval.models entry point, the _model_manifest factory, and its registration tests. The Olive LMMSEvaluator path imports LMMSORTGenAIEvaluator directly, so the entry point only affected the standalone lmms-eval CLI; dropping it keeps setup.py out of this change.
Contributor
|
@DelwinKim please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Comment on lines
+198
to
+205
| return prompt_template.format( | ||
| system_prompt=system_prompt, | ||
| user_content=user_content, | ||
| text=user_text, | ||
| image_tokens=image_tokens, | ||
| audio_tokens=audio_tokens, | ||
| model_type=model_type, | ||
| ) |
| Avoids constructing a real HfModelHandler (which would require a real HF | ||
| model on disk) while still exercising the dispatch logic. | ||
| """ | ||
| import olive.evaluator.olive_evaluator as oe |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
This PR adds multimodal (vision + audio) evaluation to Olive by integrating the lmms-eval harness as an Olive evaluator, plus the supporting pass needed to make quantized multi-component ORT-GenAI packages evaluable.
What's added
1.
LMMSEvaluator(olive/evaluator/olive_evaluator.py)A new evaluator (
"type": "LMMSEvaluator") that runs lmms-eval benchmarks from within a singleolive run(should also just be able run eval in a standalone script if needed). It supports two model handler types:ONNXModelHandlerpointing at an ORT-GenAI multimodal package (genai_config.json+ quantized ONNX, e.g. fromMobiusBuilder+OnnxKQuantQuantization) → dispatches to the newortgenai_mmadapter.HfModelHandlerfor HuggingFace PyTorch multimodal models → dispatches to lmms-eval's native wrapper, auto-detected from the HFmodel_type(overridable viamodel_class).Benchmarks are selected purely by config (
"tasks": ["ocrbench", "docvqa_val_lite", "librispeech_test_clean", "fleurs_en", ...]), and the harness's official scorers are used (OCRBench scoring, DocVQA ANLS, WER, etc.), so results can be more easily compared.2.
ortgenai_mmadapter (olive/evaluator/lmms_ort.py)A lmms-eval model adapter that drives ORT-GenAI multimodal generation. Implements
generate_untilandloglikelihood, handles image and audio inputs (og.Images/og.Audios), per-model-type processor argument shapes (Phi-4-MM vs Whisper), and EOS/stop handling (including Whisper's BOS==EOS collision).3.
CompositeToOnnxPackagepass (olive/passes/onnx/composite_to_onnx_package.py)MobiusBuilderemits a multi-componentCompositeModelHandler(decoder / vision / audio / embedding in subdirectories). A composite can't be evaluated directly — it has no single session, and the nested layout defeats ORT-GenAI'sgenai_config.jsonauto-detection. This pass flattens the components into a single-directory ORT-GenAI package (rewriting external-data references andgenai_config.json) and returns a runnableONNXModelHandler, so the evaluator can load it.4. Registration
olive/olive_config.json: registers the new evaluator and pass.Example recipe
A complete example recipe — build + quantize Gemma‑4 E2B, flatten to an ORT‑GenAI package, then evaluate vision and audio benchmarks:
{ "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" }, "systems": { "local_system": { "type": "LocalSystem", "accelerators": [ { "device": "gpu", "execution_providers": ["CUDAExecutionProvider"] } ] } }, "passes": { "mobius_build": { "type": "MobiusBuilder", "precision": "fp16" }, "int4_quantize": { "type": "OnnxKQuantQuantization", "bits": 4, "block_size": 32, "save_as_external_data": true }, "flatten": { "type": "CompositeToOnnxPackage" } }, "evaluators": { "evaluator": { "type": "LMMSEvaluator", "tasks": ["ocrbench", "docvqa_val_lite", "librispeech_test_clean", "fleurs_en"], "batch_size": 1, "max_new_tokens": 256, "max_length": 4096, "limit": 10, "log_samples": false, "image_token_format": "<|image|>", "audio_token_format": "<|audio|>", "output_path": "results/gemma4_vision_audio.json" } }, "evaluator": "evaluator", "evaluate_input_model": false, "target": "local_system", "output_dir": "models/gemma4_vision_audio", "cache_dir": "cache/gemma4_vision_audio", "no_artifacts": true }LMMSEvaluatorconfig referencetasksocrbench,docvqa_val_lite,chartqa_lite,ai2d_lite,mmmu_val,textvqa_val,librispeech_test_clean,librispeech_test_other,fleurs_en.limitNone(full set)batch_size11is supported (see limitations).max_new_tokens256generate_untilrequest.max_length327684096) to bound memory.image_token_format"<|image_{index}|>"apply_chat_templateonly for the system/user turn scaffolding \u2014 so this must match the model's expected token. Default is Phi-4-MM style (<|image_1|>); Gemma-4 requires<|image|>. Wrong value \u21d2 the processor can't bind image tokens to images and the run fails.audio_token_format"<|audio_{index}|>"image_token_format. Gemma-4 requires<|audio|>.Tests (not very complete or thorough)
test/evaluator/test_lmms_ort.py— adapter unit tests (generation, loglikelihood, image/audio handling, processor branching, EOS logic).test/passes/onnx/test_composite_to_onnx_package.py— flatten pass (external-data rewrite, genai_config rewrite, entry-point component selection).Notes / limitations
batch_sizeis currently fixed at 1: ORT-GenAI's per-requestGenerator+set_inputsflow is single-sequence, and batching ragged, per-request multimodal (image/audio) inputs isn't currently supported by runtime.lmms-evalis imported lazily, so non-multimodal Olive usage is unaffected if it isn't installed. Running theLMMSEvaluator(and the full adapter test suite) requirespip install lmms-eval.Checklist before requesting a review
lmms-evalinstalled; mocked tests pass without it, but the tests that instantiate the real adapter need the dependency.)lintrunner -aLMMSEvaluatorfor multimodal (vision + audio) evaluation via the lmms-eval harness, andCompositeToOnnxPackageto flatten multi-component ORT-GenAI packages for evaluation.(Optional) Issue link