Skip to content

Emilia dataset selective downloader tool#455

Open
3manifold wants to merge 2 commits into
open-mmlab:mainfrom
3manifold:data_downloader
Open

Emilia dataset selective downloader tool#455
3manifold wants to merge 2 commits into
open-mmlab:mainfrom
3manifold:data_downloader

Conversation

@3manifold

@3manifold 3manifold commented Jul 14, 2025

Copy link
Copy Markdown
Contributor

✨ Description

Given that Emilia & Emilia-YODAS datasets are quite large, there can be cases that users intend to acquire only specific parts of the data. This PR proposes a data downloader tool that can selectively download data from Emilia dataset to a specified destination.

It supports:

  • Data path patter e.g. datasets/amphion/Emilia-Dataset/Emilia-YODAS/JA/*.tar as input
  • Download resume in case of interruption.

Usage

python3 preprocessors/Emilia/utils/data_downloader.py \
>   --output_data_path "/mnt/Emilia-YODAS/data" \
>   --emilia_token hf_xx \
>   --data_path_pattern "datasets/amphion/Emilia-Dataset/Emilia-YODAS/JA/*.tar"
Number of files to download: 30

JA-B000000.tar: 100%|██████████████████████████████████████████████████████| 1.07G/1.07G [02:56<00:00, 6.08MB/s]
20xx-07-xx 10:11:42.724579 downloaded file: Emilia-YODAS/data/Emilia-YODAS/JA/JA-B000000.tar
...
...
downloading dataset complete

👨‍💻 Changes Proposed

  • Add Emilia data downloader tool

✅ Checklist

  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@3manifold 3manifold changed the title Emilia data downloader Emilia dataset selective downloader Jul 14, 2025
@3manifold 3manifold changed the title Emilia dataset selective downloader Emilia dataset selective downloader tool Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant