Commit 326edb0
add flight recorder tutorial (#3814)
Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch
jobs using the TorchComms Flight Recorder, covering both aggregated text
dump analysis and per-rank pickle-based CLI detection workflows.
Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>
Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>1 parent cdc645a commit 326edb0
3 files changed
Lines changed: 508 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
| 214 | + | |
214 | 215 | | |
215 | 216 | | |
216 | 217 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
724 | 724 | | |
725 | 725 | | |
726 | 726 | | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
727 | 734 | | |
728 | 735 | | |
729 | 736 | | |
| |||
0 commit comments