Fix AttributeError when iterating IterableDatasetShard without set_epoch#4098
Open
vineethsaivs wants to merge 1 commit into
Open
Fix AttributeError when iterating IterableDatasetShard without set_epoch#4098vineethsaivs wants to merge 1 commit into
vineethsaivs wants to merge 1 commit into
Conversation
IterableDatasetShard.__init__ never initialized self.epoch, but __iter__ reads it to seed the underlying dataset's generator whenever the dataset has a torch.Generator attribute and no set_epoch method, which is exactly the dataset shape that branch was written for. Iterating a fresh shard without first calling set_epoch therefore raised AttributeError: 'IterableDatasetShard' object has no attribute 'epoch'. Initialize epoch to 0, matching SeedableRandomSampler in the same module, so direct iteration seeds with the default epoch and set_epoch semantics stay unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Iterating an
IterableDatasetShardthat wraps a dataset carrying atorch.Generator(and noset_epochmethod) raises unless the caller happens to callset_epochfirst:Root cause:
__init__never initializesself.epoch; the only assignment lives inset_epoch. But__iter__'s first branch readsself.epochto seed the generator for precisely this dataset shape (hasattr(dataset, "generator")and noset_epoch). Accelerate's own managed path callsset_epochbefore iterating, which is why the crash only bites direct users of the class, a usage pattern the existing tests treat as supported (check_iterable_dataset_shardsiterates shards directly; it only survives because its test dataset has nogeneratorattribute).Fix: initialize
self.epoch = 0in__init__, matchingSeedableRandomSamplerin the same module (and transformers'IterableDatasetShard). Direct iteration now seeds with the default epoch 0;set_epochsemantics are unchanged.Test:
test_iterable_dataset_shard_without_set_epochreproduces the exact shape (generator-carrying dataset, noset_epochcall): fails before with theAttributeErroratdata_loader.py:346, passes after; also asserts the default seeding is deterministic. Fulltests/test_data_loader.py: 25 passed, 13 skipped.ruff check/ruff format --checkclean on both files.Before submitting