Hi, I'm training on a very large dataset with LeRobot (which uses Hugging Face Accelerate) across multiple GPUs.
To minimize cross-process memory usage, I want to manually shard the full dataset by GPU rank BEFORE creating the DataLoader (following this approach: https://github.com/huggingface/datasets/issues/8217).
However, I’m running into the issue where the DataLoader length gets split twice — once by my manual DistributedSampler and again by Accelerate (tracked here: https://github.com/huggingface/accelerate/issues/3520).
My Goal:
- Shard the full large dataset per GPU rank upfront (before DataLoader initialization) to reduce per-process memory footprint
- Prevent Accelerate from automatically splitting the DataLoader/dataset a second time
- Maintain correct multi-GPU training behavior
Questions:
- What’s the correct way to manually shard a dataset by GPU rank when using Accelerate?
- How can I disable Accelerate’s automatic DataLoader splitting to avoid the double-shard issue?
- Are there specific settings (e.g., Accelerator kwargs, DataLoader/DistributedSampler flags) required for this workflow?
Any suggestions would be greatly appreciated!
Hi, I'm training on a very large dataset with LeRobot (which uses Hugging Face Accelerate) across multiple GPUs.
To minimize cross-process memory usage, I want to manually shard the full dataset by GPU rank BEFORE creating the DataLoader (following this approach: https://github.com/huggingface/datasets/issues/8217).
However, I’m running into the issue where the DataLoader length gets split twice — once by my manual DistributedSampler and again by Accelerate (tracked here: https://github.com/huggingface/accelerate/issues/3520).
My Goal:
Questions:
Any suggestions would be greatly appreciated!