Resources and code for training small audio embedding models, based on experiments conducted at Jina AI.
This repo contains the training code, configs, evaluation scripts, and research report from our audio embedding project. The goal: train audio embedding models under 1.2B parameters that match or beat existing CLAP models on audio-text retrieval.
├── report.pdf # Full project report (ACL format)
├── report.tex # LaTeX source
├── audio_model.py # Audio embedding model implementation
├── audio_omni.py # Qwen2.5-Omni wrapper for audio embeddings
├── eval_mteb.py # MTEB evaluation script
├── configs/ # Training configurations
│ ├── full_3b.yaml # Baseline Qwen2.5-Omni-3B
│ ├── full_7b.yaml # Baseline Qwen2.5-Omni-7B
│ ├── prune/ # Layer pruning configs
│ ├── modality_transfer/ # Text-only training configs
│ └── module_combination/ # Module mix-and-match configs
└── scripts/ # Dataset loading scripts
├── load_clotho.py
├── load_audiosetstrong.py
├── load_fsd50k.py
├── load_macs.py
└── load_usd8k.py
| Model | Params | AudioCaps T2A cvR@5 | Clotho T2A cvR@5 |
|---|---|---|---|
| CLAP (2023) | 250M | 42.0 | 39.0 |
| Layer pruning 20L | 5.8B | 63.2 | 39.2 |
| Layer pruning 10L | 3.5B | 58.2 | 36.5 |
| Module combo M1 | 1.1B | 49.7 | 28.8 |
| Module combo M4 | 3.8B | 64.0 | 38.8 |
- Direct finetuning: Finetune Qwen2.5-Omni on audio-text pairs
- Layer pruning: Remove transformer layers to reduce size (7B -> 2.3-5.8B)
- Text-only transfer: Train on NLI text data, rely on cross-modal alignment
- Module combination: Mix audio encoders + small LLMs from different training stages
181K audio-text pairs total:
- Clotho (19K), AudioSetStrong (108K), FSD50K (41K), MACS (4K), UrbanSound8K (9K)
- Training framework: jina-ai/multimodal-large-scale-training (feat-audio branch)
- Evaluation: MTEB Audio benchmarks
- Datasets: Clotho, AudioSetStrong/WavCaps, FSD50K
- CLAP (2022, 2023): Contrastive Language-Audio Pretraining
- Tevatron 2.0: Unified document retrieval with Qwen2.5-Omni
- ColQwen-Omni: Zero-shot audio from vision-document training