Wave 7b integration#4841
Conversation
Vendors https://github.com/TCL606/WAVE at external/WAVE so the WAVE-7B embedding path (Qwen2.5-Omni modeling + BEATs dual audio encoder + hierarchical fusion) is available locally for a faithful MTEB model wrapper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Faithfully integrate WAVE-7B, a Qwen2.5-Omni-Thinker fine-tune (text/audio/video, prompt-aware, embed_dim 3584) whose embedding path requires WAVE's own modeling code (BEATs dual audio encoder + all-layer hierarchical fusion), not stock transformers. - Wave7BWrapper in mteb/models/model_implementations/wave_models.py imports WAVE's code from the external/WAVE submodule and reproduces its --pred_embeds path (model(**inputs, pred_embeds=True) -> outputs.mllm_embeds, L2-normalized). - ModelMeta wave_7b (auto-discovered), modalities text/audio/video, apache-2.0, extra_requirements_groups=["wave"], pinned revision. - pyproject: new `wave` optional-deps group (mirrors upstream requirements.txt) + uv conflict entry (pins transformers==4.51.3). - MTEB-WAVE-7B.md documents the process, the external BEATs_iter3_plus.pt requirement, and the GPU pilot recipe. No datasets/evaluators/tasks needed (MTEB already covers the audio/video/audio-visual retrieval tasks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran the pilot end-to-end on an A100-40GB (torch 2.6.0+cu124, flash_attn 2.7.4): WAVE-7B loads (BEATs dual encoder + all-layer fusion active) and ClothoT2ARetrieval scores hit_rate@5=0.317 / ndcg@10=0.260 (exceptions=[]), confirming the wrapper's --pred_embeds path produces meaningful embeddings. - pyproject `wave` extra now pins the self-consistent torch 2.6 stack (torch/torchaudio/torchcodec), sentence-transformers<5 (ST>=5 hard-imports a torchcodec built for a newer torch ABI), setuptools (triton), + ffmpeg-python. - MTEB-WAVE-7B.md: replaced the "not run" caveat with the verified grid recipe, results table, and env gotchas (BEATs from Bencr/beats-checkpoints, flash-attn wheel, ffmpeg module). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit against WAVE upstream found the wrapper embedded text through the media branch (all-layer fusion + classify_linear). WAVE embeds text via its label path: bare `text + <|im_end|>`, last token of the final layer, no head (label_ids / all_ids branches). Fixing this raised ClothoT2ARetrieval hit_rate@5 from 0.32 to 0.42. - Wave7BWrapper: text-only items now use _encode_text (label path); media items include the item's own text as the prompt; use_audio_in_video + seconds_per_chunk now mirror data_qwen._get_item and the flag is forwarded to the model call (synchronized AV wiring). - ModelMeta: n_parameters=9_410_651_007 (from checkpoint index), memory_usage_mb=17949, citation corrected (Changli Tang et al.). - pyproject `wave` extra moved to the torch 2.7.1 stack: datasets>=4 (Video -> torchcodec VideoDecoder, required by MTEB's VideoCollator) forces torchcodec>=0.4 => torch 2.7.1. Clotho score identical across stacks. - Validated on A100: ClothoT2ARetrieval 0.42 hit_rate@5; MSVDT2VRetrieval 0.80 ndcg@10 (R@1 0.63) - confirms the video frame-layout permute and the end-to-end video path. - MTEB-WAVE-7B.md: audit findings, verified results, env matrix, remaining work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Faithfulness (all wrapper paths now exercised on A100, exceptions=[]): - AV-joint path validated: AudioCapsAVVA2TRetrieval 0.54 ndcg@10 (R@1 0.31) - synchronized use_audio_in_video interleave + BEATs token doubling work. - Video fps timing now replicates WAVE exactly: _DurationVideoCollator records each video's duration so video_second_per_grid uses the actual sampled rate (fps = frames/duration) instead of the nominal 2.0; falls back when metadata is missing. Audio truncation is now opt-in (WAVE chunks 300 s natively). - Regressions unchanged after the fixes: Clotho 0.42, MSVD 0.80. - Paper comparison (R@1, ours vs arXiv:2509.21990): MSVD 63.5/56.3, MSR-VTT 50.9/54.7, DiDeMo 54.8/69.3, Clotho 19.4/25.6 - faithful-in-kind, protocol-divergent (candidate sets / query construction differ from the MMEB-v2 harness); documented in MTEB-WAVE-7B.md. Portability: - scripts/setup_wave_env.sh reconstructs everything on a new internet-connected cluster: venv from the wave extra, auto-detected flash-attn prebuilt wheel, BEATs checkpoint, optional model/dataset prefetch, preflight checks. Proven by a clean-room run on a fresh workspace; fresh-env GPU smoke reproduces Clotho 0.42 exactly. - flash-attn removed from the wave extra (cold installs attempted a source build); the script installs the matching wheel instead. - Artifact manifest (sizes/sources/reuse shortcuts) added to MTEB-WAVE-7B.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
datasets>=4 lazy Columns reject numpy integer keys; _undersample_data_indices
built its index list with np.arange, crashing multilabel tasks with
TypeError("Wrong key type ... numpy.int64"). Use range() instead.
Verified: FSD2019Kaggle (MAEB) failed with the TypeError before, scores
0.446 mAP after the fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Proves the wrapper reproduces upstream WAVE per modality: text/audio/ video/AV construction embeddings are bit-identical (cosine 1.0), and video/AV end-to-end deltas (~0.997) are the documented MTEB default frame-sampler effect. Results recorded in MTEB-WAVE-7B.md.
Correct stale MTEB benchmark runbook (MVEB now exists, model counts, Python 3.13.4, removed MAEB extended/+ names, sidecars consolidated) and record 3 deterministically-failing MVEB/MAEB tasks in MTEB-WAVE-7B.md. Wrapper faithfulness is unaffected.
There was a problem hiding this comment.
I don't think we need to add it as a submodule
| if idxs is None: | ||
| idxs = list(np.arange(len(y))) | ||
| # plain ints: datasets>=4 lazy Columns reject numpy integer keys | ||
| idxs = list(range(len(y))) |
There was a problem hiding this comment.
I don't have problems here with datasets v4
There was a problem hiding this comment.
Can you keep changes only to wave implementation and pyproject?
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Location of the vendored WAVE upstream code (git submodule). | ||
| _WAVE_REPO_PATH = Path(__file__).resolve().parents[3] / "external" / "WAVE" |
There was a problem hiding this comment.
We should run model without additional gitsubmodules/ git clonning
| _SAMPLING_RATE = 16000 | ||
|
|
||
|
|
||
| class _DurationVideoCollator(VideoCollator): |
| # Heavy / WAVE-specific imports are deferred so the registry can be built without | ||
| # WAVE's dependencies installed. | ||
| from qwenvl.data.data_qwen import LazySupervisedDataset | ||
| from qwenvl.data.processing_qwen2_5_omni import Qwen2_5OmniProcessor | ||
| from qwenvl.model.qwen2_5_omni.configuration_qwen2_5_omni import ( | ||
| Qwen2_5OmniThinkerConfig, | ||
| ) | ||
| from qwenvl.model.qwen2_5_omni.modeling_qwen2_5_omni import ( | ||
| Qwen2_5OmniThinkerForConditionalGeneration, | ||
| ) |
There was a problem hiding this comment.
Are these not the same as those in transformers?
|
|
||
| @staticmethod | ||
| def _apply_liger_kernel() -> None: | ||
| """Patch WAVE's modeling module with Liger kernels, as WAVE's eval entrypoint does.""" |
| processor.max_pixels = self._ds.data_args.image_max_frame_pixels | ||
| processor.min_pixels = self._ds.data_args.image_min_frame_pixels | ||
| processor.size["longest_edge"] = processor.max_pixels | ||
| processor.size["shortest_edge"] = processor.min_pixels |
| if width < 28 or height < 28: | ||
| pad_width = max(0, 28 - width) | ||
| pad_height = max(0, 28 - height) | ||
| left, top = pad_width // 2, pad_height // 2 | ||
| image = ImageOps.expand( | ||
| image, | ||
| border=(left, top, pad_width - left, pad_height - top), | ||
| fill=(0, 0, 0), | ||
| ) |
There was a problem hiding this comment.
Can't transformers processor do this?
| training_datasets=None, | ||
| adapted_from="Qwen/Qwen2.5-Omni-7B", | ||
| superseded_by=None, | ||
| modalities=["text", "audio", "video"], |
There was a problem hiding this comment.
| modalities=["text", "audio", "video"], | |
| modalities=["text", "audio", "video", "image"], |
|
Thank you @Samoed for your prompt review. Apologies for this rushed PR, I did not mean to open the PR prematurely. The PR was meant to be for a local repository. However, let me work on your review, one at a time and I will update the relevant sections. |
|
I have a quick question on this one @Samoed. Tsinghua lab has WAVE weights with a vanilla config only, there's no modeling code and no I think there's two ways to go about hosting this, which do you prefer?
Let me know, happy to go whichever way you think. |
|
I think you can try to do 1, and if authors don't response do 2 |
If you add a model or a dataset, please add the corresponding checklist:
Close #4613