Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks by JackZ-db · Pull Request #3503 · mosaicml/composer

JackZ-db · 2024-07-30T06:00:41Z

No description provided.

mvpatel2000

Rerequest once test passes!

mvpatel2000

first pass, design looks right but code needs some cleanup

mvpatel2000 · 2024-07-31T18:27:05Z

+@no_type_check
+def unshard(self):
+    """
+    Run the unshard logic. 
+    This is an unpatched method from pytorch, meant to be reverted to 
+    whenever automicrobatching turns off its hooks for increased throughput.
+    This includes all-gathering the flat parameter
+    and switching to using the unsharded flat parameter. If the handle does
+    not need unsharding, then this only switches to using the unsharded
+    flat parameter. For ``NO_SHARD``, this is a no-op.
+    If FSDP is in :meth:`summon_full_params` and the handle uses parameter


This should probably be in the if torch 2.3.1 section

mvpatel2000 · 2024-07-31T18:27:28Z

+
+            if auto_microbatching:


can you add a comment on what this is doing?

mvpatel2000 · 2024-07-31T18:28:06Z

+def _double_device_train_microbatch_size(state: State):
+    """Double device_train_microbatch_size when automicrobatching searches upward for a higher non-OOM microbatch size.


should this go into automcirobatching utils folder?

mvpatel2000 · 2024-07-31T18:28:16Z

+        num_consecutive_thrashes = 0
+    return num_consecutive_thrashes
+
+def _handle_downward_search_in_automicrobatching(state: State, lowest_oom_microbatch_size: int, highest_non_oom_microbatch_size: int, lower_bound_microbatch_size: int, num_search_steps: int, max_search_steps: int):


same comment on moving to utils?

mvpatel2000 · 2024-07-31T18:29:18Z

        if parallelism_config is not None:
            # Patch PyTorch to fix distributed bugs
            patch_pytorch()
+            patch_unshard_for_automicrobatching(self.auto_microbatch_size_found)


this should be just part of patch_pytorch to simplify interface

we need to pass in a boolean variable telling it how to patch this one specific method though - i feel like it would be less readable if we passed self.auto_microbatch_size_found directly into patch_pytorch

mvpatel2000 · 2024-07-31T18:30:40Z

+                # Sync for OOMs
+                found_cuda_oom = _found_ooms_across_ranks(self.state, found_cuda_oom)


this block is really complicated. lets move to a helper fn

mvpatel2000 · 2024-07-31T18:30:53Z

+
        with torch.no_grad(), model_eval_mode(self.state.model):
+            if self.state.fsdp_enabled and self.first_batch_complete:
+                print("readd hooks for eval")


mvpatel2000 · 2024-07-31T18:31:03Z

    extract_hparams,
 )
+from composer.utils.automicrobatching import (
+    # _create_sync_hook,


mvpatel2000 · 2024-07-31T18:31:08Z

    'validate_credentials',
    'build_remote_backend',
    'RemoteFilesExistingCheckStatus',
+    # '_create_sync_hook',


JackZ-db added 8 commits July 29, 2024 22:57

add automicrobatching for non-powers-of-2 + adaptive sync hooks

bbb9a66

include auto helpers in _all_

4228889

fix circular imports

a537c4c

remove circular import

ff806d1

remove import state

4025274

dist

9ad1719

fix imports

896b999

import defaultdict

7146f23

JackZ-db requested review from j316chuck and mvpatel2000 July 30, 2024 07:58

mvpatel2000 reviewed Jul 30, 2024

View reviewed changes

JackZ-db added 10 commits July 30, 2024 09:46

log for hook on off

cd2fe9f

fixed hook readd bug

b1b16cd

rename hooks to fsdp hooks, will only trigger if fsdp

476e028

only invoke hook logic if fsdp enabled

0693364

typo

df404ac

fix seq length warmup

0b6d6ce

only patch flat param handle unshard if > 2.3

153c413

fix version comparison

d98926d

mark unit test

3e82ef6

remove device mark

a09b844

JackZ-db requested a review from mvpatel2000 July 31, 2024 01:45

JackZ-db added 4 commits July 30, 2024 18:56

filter user warnigns out

33840c5

fix

e87c9f6

dist sampler

86c32a0

ignore runtime warning

a193b76

JackZ-db requested a review from bigning July 31, 2024 16:39

only drop hooks after 3 consecutive successes with this microbatch size

0b6a30d

mvpatel2000 reviewed Jul 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks#3503

Add Automicrobatching for Non-Powers-of-2 + Fixes to FSDP deadlocks using Adaptive Sync Hooks#3503
JackZ-db wants to merge 23 commits intomosaicml:mainfrom
JackZ-db:jz/auto_non_powers_of_2

JackZ-db commented Jul 30, 2024

Uh oh!

mvpatel2000 left a comment

Uh oh!

mvpatel2000 left a comment

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

JackZ-db Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

mvpatel2000 Jul 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def _double_device_train_microbatch_size(state: State):
		"""Double device_train_microbatch_size when automicrobatching searches upward for a higher non-OOM microbatch size.

		# Sync for OOMs
		found_cuda_oom = _found_ooms_across_ranks(self.state, found_cuda_oom)

Conversation

JackZ-db commented Jul 30, 2024

Uh oh!

mvpatel2000 left a comment

Choose a reason for hiding this comment

Uh oh!

mvpatel2000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants