Skip to content

Wave 7b integration#4841

Draft
debashishc wants to merge 13 commits into
embeddings-benchmark:mainfrom
JSALT2026-OmniEnc:wave-7b-integration
Draft

Wave 7b integration#4841
debashishc wants to merge 13 commits into
embeddings-benchmark:mainfrom
JSALT2026-OmniEnc:wave-7b-integration

Conversation

@debashishc

@debashishc debashishc commented Jun 20, 2026

Copy link
Copy Markdown

If you add a model or a dataset, please add the corresponding checklist:

Close #4613

debashishc and others added 12 commits June 10, 2026 02:38
Vendors https://github.com/TCL606/WAVE at external/WAVE so the WAVE-7B
embedding path (Qwen2.5-Omni modeling + BEATs dual audio encoder +
hierarchical fusion) is available locally for a faithful MTEB model wrapper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Faithfully integrate WAVE-7B, a Qwen2.5-Omni-Thinker fine-tune (text/audio/video,
prompt-aware, embed_dim 3584) whose embedding path requires WAVE's own modeling code
(BEATs dual audio encoder + all-layer hierarchical fusion), not stock transformers.

- Wave7BWrapper in mteb/models/model_implementations/wave_models.py imports WAVE's code
  from the external/WAVE submodule and reproduces its --pred_embeds path
  (model(**inputs, pred_embeds=True) -> outputs.mllm_embeds, L2-normalized).
- ModelMeta wave_7b (auto-discovered), modalities text/audio/video, apache-2.0,
  extra_requirements_groups=["wave"], pinned revision.
- pyproject: new `wave` optional-deps group (mirrors upstream requirements.txt) + uv
  conflict entry (pins transformers==4.51.3).
- MTEB-WAVE-7B.md documents the process, the external BEATs_iter3_plus.pt requirement,
  and the GPU pilot recipe. No datasets/evaluators/tasks needed (MTEB already covers the
  audio/video/audio-visual retrieval tasks).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran the pilot end-to-end on an A100-40GB (torch 2.6.0+cu124, flash_attn 2.7.4):
WAVE-7B loads (BEATs dual encoder + all-layer fusion active) and
ClothoT2ARetrieval scores hit_rate@5=0.317 / ndcg@10=0.260 (exceptions=[]),
confirming the wrapper's --pred_embeds path produces meaningful embeddings.

- pyproject `wave` extra now pins the self-consistent torch 2.6 stack
  (torch/torchaudio/torchcodec), sentence-transformers<5 (ST>=5 hard-imports a
  torchcodec built for a newer torch ABI), setuptools (triton), + ffmpeg-python.
- MTEB-WAVE-7B.md: replaced the "not run" caveat with the verified grid recipe,
  results table, and env gotchas (BEATs from Bencr/beats-checkpoints, flash-attn
  wheel, ffmpeg module).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit against WAVE upstream found the wrapper embedded text through the media
branch (all-layer fusion + classify_linear). WAVE embeds text via its label
path: bare `text + <|im_end|>`, last token of the final layer, no head
(label_ids / all_ids branches). Fixing this raised ClothoT2ARetrieval
hit_rate@5 from 0.32 to 0.42.

- Wave7BWrapper: text-only items now use _encode_text (label path); media
  items include the item's own text as the prompt; use_audio_in_video +
  seconds_per_chunk now mirror data_qwen._get_item and the flag is forwarded
  to the model call (synchronized AV wiring).
- ModelMeta: n_parameters=9_410_651_007 (from checkpoint index),
  memory_usage_mb=17949, citation corrected (Changli Tang et al.).
- pyproject `wave` extra moved to the torch 2.7.1 stack: datasets>=4 (Video ->
  torchcodec VideoDecoder, required by MTEB's VideoCollator) forces
  torchcodec>=0.4 => torch 2.7.1. Clotho score identical across stacks.
- Validated on A100: ClothoT2ARetrieval 0.42 hit_rate@5;
  MSVDT2VRetrieval 0.80 ndcg@10 (R@1 0.63) - confirms the video frame-layout
  permute and the end-to-end video path.
- MTEB-WAVE-7B.md: audit findings, verified results, env matrix, remaining work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Faithfulness (all wrapper paths now exercised on A100, exceptions=[]):
- AV-joint path validated: AudioCapsAVVA2TRetrieval 0.54 ndcg@10 (R@1 0.31) -
  synchronized use_audio_in_video interleave + BEATs token doubling work.
- Video fps timing now replicates WAVE exactly: _DurationVideoCollator records
  each video's duration so video_second_per_grid uses the actual sampled rate
  (fps = frames/duration) instead of the nominal 2.0; falls back when metadata
  is missing. Audio truncation is now opt-in (WAVE chunks 300 s natively).
- Regressions unchanged after the fixes: Clotho 0.42, MSVD 0.80.
- Paper comparison (R@1, ours vs arXiv:2509.21990): MSVD 63.5/56.3,
  MSR-VTT 50.9/54.7, DiDeMo 54.8/69.3, Clotho 19.4/25.6 - faithful-in-kind,
  protocol-divergent (candidate sets / query construction differ from the
  MMEB-v2 harness); documented in MTEB-WAVE-7B.md.

Portability:
- scripts/setup_wave_env.sh reconstructs everything on a new internet-connected
  cluster: venv from the wave extra, auto-detected flash-attn prebuilt wheel,
  BEATs checkpoint, optional model/dataset prefetch, preflight checks. Proven
  by a clean-room run on a fresh workspace; fresh-env GPU smoke reproduces
  Clotho 0.42 exactly.
- flash-attn removed from the wave extra (cold installs attempted a source
  build); the script installs the matching wheel instead.
- Artifact manifest (sizes/sources/reuse shortcuts) added to MTEB-WAVE-7B.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
datasets>=4 lazy Columns reject numpy integer keys; _undersample_data_indices
built its index list with np.arange, crashing multilabel tasks with
TypeError("Wrong key type ... numpy.int64"). Use range() instead.

Verified: FSD2019Kaggle (MAEB) failed with the TypeError before, scores
0.446 mAP after the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Proves the wrapper reproduces upstream WAVE per modality: text/audio/
video/AV construction embeddings are bit-identical (cosine 1.0), and
video/AV end-to-end deltas (~0.997) are the documented MTEB default
frame-sampler effect. Results recorded in MTEB-WAVE-7B.md.
Correct stale MTEB benchmark runbook (MVEB now exists, model counts,
Python 3.13.4, removed MAEB extended/+ names, sidecars consolidated) and
record 3 deterministically-failing MVEB/MAEB tasks in MTEB-WAVE-7B.md.
Wrapper faithfulness is unaffected.
@Samoed Samoed added the new model Questions related to adding a new model to the benchmark label Jun 20, 2026
Comment thread .gitmodules

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add it as a submodule

if idxs is None:
idxs = list(np.arange(len(y)))
# plain ints: datasets>=4 lazy Columns reject numpy integer keys
idxs = list(range(len(y)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have problems here with datasets v4

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep changes only to wave implementation and pyproject?

logger = logging.getLogger(__name__)

# Location of the vendored WAVE upstream code (git submodule).
_WAVE_REPO_PATH = Path(__file__).resolve().parents[3] / "external" / "WAVE"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run model without additional gitsubmodules/ git clonning

_SAMPLING_RATE = 16000


class _DurationVideoCollator(VideoCollator):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Comment on lines +140 to +149
# Heavy / WAVE-specific imports are deferred so the registry can be built without
# WAVE's dependencies installed.
from qwenvl.data.data_qwen import LazySupervisedDataset
from qwenvl.data.processing_qwen2_5_omni import Qwen2_5OmniProcessor
from qwenvl.model.qwen2_5_omni.configuration_qwen2_5_omni import (
Qwen2_5OmniThinkerConfig,
)
from qwenvl.model.qwen2_5_omni.modeling_qwen2_5_omni import (
Qwen2_5OmniThinkerForConditionalGeneration,
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not the same as those in transformers?


@staticmethod
def _apply_liger_kernel() -> None:
"""Patch WAVE's modeling module with Liger kernels, as WAVE's eval entrypoint does."""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reqired?

Comment on lines +225 to +228
processor.max_pixels = self._ds.data_args.image_max_frame_pixels
processor.min_pixels = self._ds.data_args.image_min_frame_pixels
processor.size["longest_edge"] = processor.max_pixels
processor.size["shortest_edge"] = processor.min_pixels

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set in inint?

Comment on lines +232 to +240
if width < 28 or height < 28:
pad_width = max(0, 28 - width)
pad_height = max(0, 28 - height)
left, top = pad_width // 2, pad_height // 2
image = ImageOps.expand(
image,
border=(left, top, pad_width - left, pad_height - top),
fill=(0, 0, 0),
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't transformers processor do this?

training_datasets=None,
adapted_from="Qwen/Qwen2.5-Omni-7B",
superseded_by=None,
modalities=["text", "audio", "video"],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modalities=["text", "audio", "video"],
modalities=["text", "audio", "video", "image"],

@debashishc

Copy link
Copy Markdown
Author

Thank you @Samoed for your prompt review. Apologies for this rushed PR, I did not mean to open the PR prematurely. The PR was meant to be for a local repository. However, let me work on your review, one at a time and I will update the relevant sections.

@debashishc debashishc marked this pull request as draft June 20, 2026 22:40
@debashishc

Copy link
Copy Markdown
Author

I have a quick question on this one @Samoed.

Tsinghua lab has WAVE weights with a vanilla config only, there's no modeling code and no auto_map. Also WAVE's architecture (BEATs audio encoder with the fusion head) isn't in transformers based on my check. So trust_remote_code against the authors' repo doesn't work as is. Unless you have a preferred fix, the code has to live in an HF repo somewhere.

I think there's two ways to go about hosting this, which do you prefer?

  1. Ask the WAVE authors to add the modeling + auto_map (I can open a PR). This keeps the original repo the single source of truth, but depends on if/when the PR gets merged.
  2. We host it on an HF repo we control. Either a full mirror or a code-only repo where the wrapper still pulls weights from tsinghua-ee/WAVE-7B. I can get that up and going.

Let me know, happy to go whichever way you think.

@Samoed

Samoed commented Jun 21, 2026

Copy link
Copy Markdown
Member

I think you can try to do 1, and if authors don't response do 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model Questions related to adding a new model to the benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add model: WAVE

2 participants