Skip to content

Commit 326edb0

Browse files
add flight recorder tutorial (#3814)
Summary: Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows. Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com> Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
1 parent cdc645a commit 326edb0

3 files changed

Lines changed: 508 additions & 0 deletions

File tree

distributed.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,7 @@ Custom Extensions
211211
intermediate/rpc_param_server_tutorial
212212
intermediate/rpc_async_execution
213213
intermediate/monarch_distributed_tutorial
214+
intermediate/debug_hangs_with_flight_recorder
214215
advanced/rpc_ddp_tutorial
215216
advanced/generic_join
216217
beginner/distributed_training_with_ray_tutorial

index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -724,6 +724,13 @@ Welcome to PyTorch Tutorials
724724
:link: intermediate/monarch_distributed_tutorial.html
725725
:tags: Parallel-and-Distributed-Training
726726

727+
.. customcarditem::
728+
:header: Debugging Hangs with Flight Recorder Using TorchComms and Debug Server
729+
:card_description: Diagnose hangs using the TorchComms Flight Recorder and Debug Server periodic dumps.
730+
:image: _static/img/thumbnails/cropped/generic-pytorch-logo.png
731+
:link: intermediate/debug_hangs_with_flight_recorder.html
732+
:tags: Parallel-and-Distributed-Training,Debugging
733+
727734
.. Edge
728735
729736
.. customcarditem::

0 commit comments

Comments
 (0)