Skip to content

jina-ai/audio-embedding-kickstarter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Embedding Kickstarter

Resources and code for training small audio embedding models, based on experiments conducted at Jina AI.

Overview

This repo contains the training code, configs, evaluation scripts, and research report from our audio embedding project. The goal: train audio embedding models under 1.2B parameters that match or beat existing CLAP models on audio-text retrieval.

What's Here

├── report.pdf          # Full project report (ACL format)
├── report.tex          # LaTeX source
├── audio_model.py      # Audio embedding model implementation
├── audio_omni.py       # Qwen2.5-Omni wrapper for audio embeddings
├── eval_mteb.py        # MTEB evaluation script
├── configs/            # Training configurations
│   ├── full_3b.yaml           # Baseline Qwen2.5-Omni-3B
│   ├── full_7b.yaml           # Baseline Qwen2.5-Omni-7B
│   ├── prune/                 # Layer pruning configs
│   ├── modality_transfer/     # Text-only training configs
│   └── module_combination/    # Module mix-and-match configs
└── scripts/            # Dataset loading scripts
    ├── load_clotho.py
    ├── load_audiosetstrong.py
    ├── load_fsd50k.py
    ├── load_macs.py
    └── load_usd8k.py

Key Results

Model Params AudioCaps T2A cvR@5 Clotho T2A cvR@5
CLAP (2023) 250M 42.0 39.0
Layer pruning 20L 5.8B 63.2 39.2
Layer pruning 10L 3.5B 58.2 36.5
Module combo M1 1.1B 49.7 28.8
Module combo M4 3.8B 64.0 38.8

Approaches

  1. Direct finetuning: Finetune Qwen2.5-Omni on audio-text pairs
  2. Layer pruning: Remove transformer layers to reduce size (7B -> 2.3-5.8B)
  3. Text-only transfer: Train on NLI text data, rely on cross-modal alignment
  4. Module combination: Mix audio encoders + small LLMs from different training stages

Training Data

181K audio-text pairs total:

  • Clotho (19K), AudioSetStrong (108K), FSD50K (41K), MACS (4K), UrbanSound8K (9K)

Related

References

  • CLAP (2022, 2023): Contrastive Language-Audio Pretraining
  • Tevatron 2.0: Unified document retrieval with Qwen2.5-Omni
  • ColQwen-Omni: Zero-shot audio from vision-document training

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors