Wave 7b integration by debashishc · Pull Request #4841 · embeddings-benchmark/mteb

debashishc · 2026-06-20T18:26:28Z

If you add a model or a dataset, please add the corresponding checklist:

Vendors https://github.com/TCL606/WAVE at external/WAVE so the WAVE-7B embedding path (Qwen2.5-Omni modeling + BEATs dual audio encoder + hierarchical fusion) is available locally for a faithful MTEB model wrapper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Faithfully integrate WAVE-7B, a Qwen2.5-Omni-Thinker fine-tune (text/audio/video, prompt-aware, embed_dim 3584) whose embedding path requires WAVE's own modeling code (BEATs dual audio encoder + all-layer hierarchical fusion), not stock transformers. - Wave7BWrapper in mteb/models/model_implementations/wave_models.py imports WAVE's code from the external/WAVE submodule and reproduces its --pred_embeds path (model(**inputs, pred_embeds=True) -> outputs.mllm_embeds, L2-normalized). - ModelMeta wave_7b (auto-discovered), modalities text/audio/video, apache-2.0, extra_requirements_groups=["wave"], pinned revision. - pyproject: new `wave` optional-deps group (mirrors upstream requirements.txt) + uv conflict entry (pins transformers==4.51.3). - MTEB-WAVE-7B.md documents the process, the external BEATs_iter3_plus.pt requirement, and the GPU pilot recipe. No datasets/evaluators/tasks needed (MTEB already covers the audio/video/audio-visual retrieval tasks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ran the pilot end-to-end on an A100-40GB (torch 2.6.0+cu124, flash_attn 2.7.4): WAVE-7B loads (BEATs dual encoder + all-layer fusion active) and ClothoT2ARetrieval scores hit_rate@5=0.317 / ndcg@10=0.260 (exceptions=[]), confirming the wrapper's --pred_embeds path produces meaningful embeddings. - pyproject `wave` extra now pins the self-consistent torch 2.6 stack (torch/torchaudio/torchcodec), sentence-transformers<5 (ST>=5 hard-imports a torchcodec built for a newer torch ABI), setuptools (triton), + ffmpeg-python. - MTEB-WAVE-7B.md: replaced the "not run" caveat with the verified grid recipe, results table, and env gotchas (BEATs from Bencr/beats-checkpoints, flash-attn wheel, ffmpeg module). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Audit against WAVE upstream found the wrapper embedded text through the media branch (all-layer fusion + classify_linear). WAVE embeds text via its label path: bare `text + <|im_end|>`, last token of the final layer, no head (label_ids / all_ids branches). Fixing this raised ClothoT2ARetrieval hit_rate@5 from 0.32 to 0.42. - Wave7BWrapper: text-only items now use _encode_text (label path); media items include the item's own text as the prompt; use_audio_in_video + seconds_per_chunk now mirror data_qwen._get_item and the flag is forwarded to the model call (synchronized AV wiring). - ModelMeta: n_parameters=9_410_651_007 (from checkpoint index), memory_usage_mb=17949, citation corrected (Changli Tang et al.). - pyproject `wave` extra moved to the torch 2.7.1 stack: datasets>=4 (Video -> torchcodec VideoDecoder, required by MTEB's VideoCollator) forces torchcodec>=0.4 => torch 2.7.1. Clotho score identical across stacks. - Validated on A100: ClothoT2ARetrieval 0.42 hit_rate@5; MSVDT2VRetrieval 0.80 ndcg@10 (R@1 0.63) - confirms the video frame-layout permute and the end-to-end video path. - MTEB-WAVE-7B.md: audit findings, verified results, env matrix, remaining work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…b-integration

Faithfulness (all wrapper paths now exercised on A100, exceptions=[]): - AV-joint path validated: AudioCapsAVVA2TRetrieval 0.54 ndcg@10 (R@1 0.31) - synchronized use_audio_in_video interleave + BEATs token doubling work. - Video fps timing now replicates WAVE exactly: _DurationVideoCollator records each video's duration so video_second_per_grid uses the actual sampled rate (fps = frames/duration) instead of the nominal 2.0; falls back when metadata is missing. Audio truncation is now opt-in (WAVE chunks 300 s natively). - Regressions unchanged after the fixes: Clotho 0.42, MSVD 0.80. - Paper comparison (R@1, ours vs arXiv:2509.21990): MSVD 63.5/56.3, MSR-VTT 50.9/54.7, DiDeMo 54.8/69.3, Clotho 19.4/25.6 - faithful-in-kind, protocol-divergent (candidate sets / query construction differ from the MMEB-v2 harness); documented in MTEB-WAVE-7B.md. Portability: - scripts/setup_wave_env.sh reconstructs everything on a new internet-connected cluster: venv from the wave extra, auto-detected flash-attn prebuilt wheel, BEATs checkpoint, optional model/dataset prefetch, preflight checks. Proven by a clean-room run on a fresh workspace; fresh-env GPU smoke reproduces Clotho 0.42 exactly. - flash-attn removed from the wave extra (cold installs attempted a source build); the script installs the matching wheel instead. - Artifact manifest (sizes/sources/reuse shortcuts) added to MTEB-WAVE-7B.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

datasets>=4 lazy Columns reject numpy integer keys; _undersample_data_indices built its index list with np.arange, crashing multilabel tasks with TypeError("Wrong key type ... numpy.int64"). Use range() instead. Verified: FSD2019Kaggle (MAEB) failed with the TypeError before, scores 0.446 mAP after the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Proves the wrapper reproduces upstream WAVE per modality: text/audio/ video/AV construction embeddings are bit-identical (cosine 1.0), and video/AV end-to-end deltas (~0.997) are the documented MTEB default frame-sampler effect. Results recorded in MTEB-WAVE-7B.md.

Correct stale MTEB benchmark runbook (MVEB now exists, model counts, Python 3.13.4, removed MAEB extended/+ names, sidecars consolidated) and record 3 deterministically-failing MVEB/MAEB tasks in MTEB-WAVE-7B.md. Wrapper faithfulness is unaffected.

Samoed · 2026-06-20T18:44:31Z

I don't think we need to add it as a submodule

Samoed · 2026-06-20T18:46:46Z

        if idxs is None:
-            idxs = list(np.arange(len(y)))
+            # plain ints: datasets>=4 lazy Columns reject numpy integer keys
+            idxs = list(range(len(y)))


I don't have problems here with datasets v4

Samoed · 2026-06-20T18:47:34Z

Can you keep changes only to wave implementation and pyproject?

Samoed · 2026-06-20T18:48:03Z

+logger = logging.getLogger(__name__)
+
+# Location of the vendored WAVE upstream code (git submodule).
+_WAVE_REPO_PATH = Path(__file__).resolve().parents[3] / "external" / "WAVE"


We should run model without additional gitsubmodules/ git clonning

Samoed · 2026-06-20T18:48:34Z

+_SAMPLING_RATE = 16000
+
+
+class _DurationVideoCollator(VideoCollator):


Why do you need this?

Samoed · 2026-06-20T18:49:29Z

+        # Heavy / WAVE-specific imports are deferred so the registry can be built without
+        # WAVE's dependencies installed.
+        from qwenvl.data.data_qwen import LazySupervisedDataset
+        from qwenvl.data.processing_qwen2_5_omni import Qwen2_5OmniProcessor
+        from qwenvl.model.qwen2_5_omni.configuration_qwen2_5_omni import (
+            Qwen2_5OmniThinkerConfig,
+        )
+        from qwenvl.model.qwen2_5_omni.modeling_qwen2_5_omni import (
+            Qwen2_5OmniThinkerForConditionalGeneration,
+        )


Are these not the same as those in transformers?

Samoed · 2026-06-20T18:49:58Z

+
+    @staticmethod
+    def _apply_liger_kernel() -> None:
+        """Patch WAVE's modeling module with Liger kernels, as WAVE's eval entrypoint does."""


Is this reqired?

Samoed · 2026-06-20T18:50:27Z

+        processor.max_pixels = self._ds.data_args.image_max_frame_pixels
+        processor.min_pixels = self._ds.data_args.image_min_frame_pixels
+        processor.size["longest_edge"] = processor.max_pixels
+        processor.size["shortest_edge"] = processor.min_pixels


Why not set in inint?

Samoed · 2026-06-20T18:50:49Z

+        if width < 28 or height < 28:
+            pad_width = max(0, 28 - width)
+            pad_height = max(0, 28 - height)
+            left, top = pad_width // 2, pad_height // 2
+            image = ImageOps.expand(
+                image,
+                border=(left, top, pad_width - left, pad_height - top),
+                fill=(0, 0, 0),
+            )


Can't transformers processor do this?

Samoed · 2026-06-20T18:51:14Z

+    training_datasets=None,
+    adapted_from="Qwen/Qwen2.5-Omni-7B",
+    superseded_by=None,
+    modalities=["text", "audio", "video"],


Suggested change

modalities=["text", "audio", "video"],

modalities=["text", "audio", "video", "image"],

debashishc · 2026-06-20T21:21:09Z

Thank you @Samoed for your prompt review. Apologies for this rushed PR, I did not mean to open the PR prematurely. The PR was meant to be for a local repository. However, let me work on your review, one at a time and I will update the relevant sections.

debashishc · 2026-06-20T23:50:37Z

I have a quick question on this one @Samoed.

Tsinghua lab has WAVE weights with a vanilla config only, there's no modeling code and no auto_map. Also WAVE's architecture (BEATs audio encoder with the fusion head) isn't in transformers based on my check. So trust_remote_code against the authors' repo doesn't work as is. Unless you have a preferred fix, the code has to live in an HF repo somewhere.

I think there's two ways to go about hosting this, which do you prefer?

Ask the WAVE authors to add the modeling + auto_map (I can open a PR). This keeps the original repo the single source of truth, but depends on if/when the PR gets merged.
We host it on an HF repo we control. Either a full mirror or a code-only repo where the wrapper still pulls weights from tsinghua-ee/WAVE-7B. I can get that up and going.

Let me know, happy to go whichever way you think.

Samoed · 2026-06-21T04:04:15Z

I think you can try to do 1, and if authors don't response do 2

debashishc and others added 12 commits June 10, 2026 02:38

Merge branch 'embeddings-benchmark:main' into wave-7b-integration

16d9976

Merge remote-tracking branch 'origin/wave-7b-integration' into wave-7…

e9f2c5d

…b-integration

Merge upstream/main into wave-7b-integration

af9bd73

Merge upstream/main into wave-7b-integration

459cc2f

Samoed added the new model Questions related to adding a new model to the benchmark label Jun 20, 2026

Samoed reviewed Jun 20, 2026

View reviewed changes

debashishc marked this pull request as draft June 20, 2026 22:40

debashishc mentioned this pull request Jun 25, 2026

Add trust_remote_code loading for WAVE-7B TCL606/WAVE#5

Open

Merge upstream/main into wave-7b-integration

eb6c998

		_SAMPLING_RATE = 16000


		class _DurationVideoCollator(VideoCollator):

	modalities=["text", "audio", "video"],
	modalities=["text", "audio", "video", "image"],

Uh oh!

Conversation

debashishc commented Jun 20, 2026 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debashishc commented Jun 20, 2026

Uh oh!

debashishc commented Jun 20, 2026

Uh oh!

Samoed commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

debashishc commented Jun 20, 2026 •

edited by Samoed

Loading