Skip to content

Adding torch accelerator to FSDP2 example#36

Open
dggaytan wants to merge 1 commit into
mainfrom
dggaytan/distributed_FSDP2
Open

Adding torch accelerator to FSDP2 example#36
dggaytan wants to merge 1 commit into
mainfrom
dggaytan/distributed_FSDP2

Conversation

@dggaytan

@dggaytan dggaytan commented Jul 8, 2025

Copy link
Copy Markdown
Collaborator

Updating train.py FSDP2 script to use accelerator

@jafraustro

Copy link
Copy Markdown
Owner

Add the examples to the CI

you can do something similar to this -> https://github.com/pytorch/examples/pull/1364/files#diff-ea329c10af35989ab732dc10a3dc98101f71c7fc96f1d8e186cba9088f33a216

@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch from b43765f to 959f8bf Compare July 11, 2025 21:21
@dggaytan

Copy link
Copy Markdown
Collaborator Author

@jafraustro done, can you please take a look? before opening PR to mainstream

Comment thread distributed/FSDP2/run_example.sh Outdated
Comment thread distributed/FSDP2/run_example.sh
@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch 2 times, most recently from 240e295 to 68972e8 Compare July 14, 2025 19:54
@dggaytan dggaytan requested a review from jafraustro July 14, 2025 19:56
Comment thread distributed/FSDP2/requirements.txt Outdated
Comment thread distributed/FSDP2/run_example.sh
Comment thread run_distributed_examples.sh Outdated
@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch from 68972e8 to 97165f0 Compare July 15, 2025 19:49
@dggaytan dggaytan requested a review from jafraustro July 15, 2025 19:55

@jafraustro jafraustro left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the minor changes

Comment thread distributed/FSDP2/example.py Outdated
parser.add_argument("--mixed-precision", action="store_true", default=False)
parser.add_argument("--dcp-api", action="store_true", default=False)
args = parser.parse_args()
_min_gpu_count = 2

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better at the beginning of the main function

Comment thread distributed/FSDP2/run_example.sh Outdated
Comment on lines +9 to +10
echo "Launching ${1:-example.py} with ${2:-4} gpus"
torchrun --nnodes=2 --nproc_per_node=${2:-4} ${1:-example.py}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Launching ${1:-example.py} with ${2:-4} gpus"
torchrun --nnodes=2 --nproc_per_node=${2:-4} ${1:-example.py}
echo "Launching ${1:-example.py} with ${2:-2} gpus"
torchrun ---nproc_per_node=${2:2} ${1:-example.py}
  • you are asking for 2 gpu in the verify_min_gpu_count function.
  • remove --nnodes=2 that is for multi node training. That is not possible with the current CI

@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch 2 times, most recently from d452326 to 3d54e15 Compare July 21, 2025 19:47
Signed-off-by: dggaytan <diana.gaytan.munoz@intel.com>
@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch from 1f0d7d3 to 5e960d8 Compare July 24, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants