Skip to content

Supporting automicrobatching on FSDP2#3866

Draft
rithwik-db wants to merge 4 commits intomainfrom
hookhandles-fsdp2
Draft

Supporting automicrobatching on FSDP2#3866
rithwik-db wants to merge 4 commits intomainfrom
hookhandles-fsdp2

Conversation

@rithwik-db
Copy link
Copy Markdown
Contributor

@rithwik-db rithwik-db commented May 29, 2025

Draft PR for supporting automicrobatching on FSDP2

This isn't added yet because we ran into some hiccups with how FSDP2 handles state transitions. As FSDP2 is stateful, it expects the program to stop when training runs into an OOM issue. Since we just restart training with a reduced microbatch size, torch.distributed.fsdp._fully_shard._fsdp_common.TrainingState can be in an illegal state which can cause hangs/unexpected errors. Until we have a clear API for something like this, it would be finnicky to mess around with this state.

fixed test issues

formatted

gated non-wrapped to FSDP1

updated for FSDP2

propagated changes to trainer

added minor test fix

formatted

formatted once more

addressed comments

formatted

minor fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant