Question: Manual Dataset Sharding per GPU Rank with Accelerate + DistributedSampler (Avoid Double DataLoader Length Split)

Hi, I'm training on a very large dataset with LeRobot (which uses Hugging Face Accelerate) across multiple GPUs. 

To minimize cross-process memory usage, I want to manually shard the full dataset by GPU rank **BEFORE** creating the DataLoader (following this approach: [https://github.com/huggingface/datasets/issues/8217](https://link.wtturl.cn/?target=https%3A%2F%2Fgithub.qkg1.top%2Fhuggingface%2Fdatasets%2Fissues%2F8217&scene=im&aid=497858&lang=zh)).

However, I’m running into the issue where the DataLoader length gets split twice — once by my manual DistributedSampler and again by Accelerate (tracked here: [https://github.com/huggingface/accelerate/issues/3520](https://link.wtturl.cn/?target=https%3A%2F%2Fgithub.qkg1.top%2Fhuggingface%2Faccelerate%2Fissues%2F3520&scene=im&aid=497858&lang=zh)).

My Goal:

- Shard the full large dataset per GPU rank upfront (before DataLoader initialization) to reduce per-process memory footprint
- Prevent Accelerate from automatically splitting the DataLoader/dataset a second time
- Maintain correct multi-GPU training behavior

Questions:

- What’s the correct way to manually shard a dataset by GPU rank when using Accelerate?
- How can I disable Accelerate’s automatic DataLoader splitting to avoid the double-shard issue?
- Are there specific settings (e.g., Accelerator kwargs, DataLoader/DistributedSampler flags) required for this workflow?


Any suggestions would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Manual Dataset Sharding per GPU Rank with Accelerate + DistributedSampler (Avoid Double DataLoader Length Split) #4062

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Question: Manual Dataset Sharding per GPU Rank with Accelerate + DistributedSampler (Avoid Double DataLoader Length Split) #4062

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions