Batch prefetching#1461
Conversation
Adds prefetch_next_batch option that loads the next batch on a background thread, overlapping disk reads with the current training step. Most beneficial when caching is enabled. Renames dataloader_threads to caching_threads to better reflect its purpose. The UI places Prefetch Next Batch above Clear cache before training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g loop Tensor uploads to the GPU in OutputPipelineModule were enqueued on the default CUDA stream, so each H2D transfer had to wait for the current training step's GPU work to finish before it could start. Running the producer under its own stream lets uploads proceed independently, allowing the prefetch queue to stay ahead of the training loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Claude: Heads up — the
Not data-destroying since 2 is a sane default, but worth either updating those 21 preset files to |
This PR renamed dataloader_threads to caching_threads in TrainConfig,
but built-in presets are loaded with migrate=False, so the old key in
these preset files was silently dropped, leaving caching_threads at its
default of 2. That conflicts with the offloading guard in create.py
("layer offloading can not be activated if caching_threads > 1") for
any preset combining offloading with the old key.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… key These presets were added after PR Nerogar#1461 renamed dataloader_threads to caching_threads, but were copied from an already-stale template. None of the anima/ideogram/lens model branches have Nerogar#1461 merged, so this fixes them directly on preview. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # modules/dataLoader/ErnieBaseDataLoader.py # modules/dataLoader/Flux2BaseDataLoader.py # modules/dataLoader/ZImageBaseDataLoader.py # training_presets/#flux2 Finetune 16GB.json # training_presets/#flux2 Finetune 24GB.json # training_presets/#flux2 LoRA 16GB.json # training_presets/#flux2 LoRA 8GB.json
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"Dataloader Threads" is a misnomer. It sounds like we are using multiple threads to load data. We actually don't:
It is the number of threads that are used to build the cache.
Loading from the cache is currently done sequentially in the training loop: load batch 1 - train batch 1 - load batch 2 - train batch 2 - ...
This can have a major performance impact if the cache lives on hdd.
This PR renames the "Dataloader Threads" to "Caching Threads" and introduces batch prefetching:
during training of batch 1, batch 2 is loaded from disk
@Calamdor