Supporting automicrobatching on FSDP2 by rithwik-db · Pull Request #3866 · mosaicml/composer

rithwik-db · 2025-05-29T18:19:10Z

Draft PR for supporting automicrobatching on FSDP2

This isn't added yet because we ran into some hiccups with how FSDP2 handles state transitions. As FSDP2 is stateful, it expects the program to stop when training runs into an OOM issue. Since we just restart training with a reduced microbatch size, torch.distributed.fsdp._fully_shard._fsdp_common.TrainingState can be in an illegal state which can cause hangs/unexpected errors. Until we have a clear API for something like this, it would be finnicky to mess around with this state.

fixed test issues formatted gated non-wrapped to FSDP1 updated for FSDP2 propagated changes to trainer added minor test fix formatted formatted once more addressed comments formatted minor fix

rithwik-db added 4 commits May 29, 2025 11:07

rebased master

c2d6a90

fixed test issues formatted gated non-wrapped to FSDP1 updated for FSDP2 propagated changes to trainer added minor test fix formatted formatted once more addressed comments formatted minor fix

updated code

7e5d250

formatted

26568b6

minor change

ab04ddd

rithwik-db mentioned this pull request May 29, 2025

Refactored auto-microbatching hook handles for FSDP #3843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting automicrobatching on FSDP2#3866

Supporting automicrobatching on FSDP2#3866
rithwik-db wants to merge 4 commits intomainfrom
hookhandles-fsdp2

rithwik-db commented May 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rithwik-db commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rithwik-db commented May 29, 2025 •

edited

Loading