Skip to content

Commit 5b4758a

Browse files
yushangdiclaude
andcommitted
Move main() execution guard for Sphinx Gallery
Move `if __name__ == "__main__": main()` to immediately after the main() function definition (line ~404) so it executes during the Sphinx Gallery build process. Sphinx Gallery requires the execution guard to be positioned right after the function definition, not at the end of the file, to properly capture and execute the tutorial code during documentation generation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent b7cc171 commit 5b4758a

10 files changed

Lines changed: 81 additions & 3 deletions

CUDA_GRAPH_TUTORIAL_README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# CUDA Graph Annotations Tutorial
2+
3+
## Overview
4+
5+
A new tutorial has been added to the PyTorch tutorials repository that demonstrates how to use CUDA graph kernel annotations for enhanced profiling and visualization.
6+
7+
## File Location
8+
9+
- **Tutorial file**: `advanced_source/cuda_graph_annotations_tutorial.py`
10+
- **Added to index**: `index.rst` (line ~518)
11+
- **Added to deep-dive**: `deep-dive.rst` (profiling section)
12+
13+
## Tutorial Content
14+
15+
The tutorial covers:
16+
17+
1. **Introduction to CUDA Graph Annotations**: Why they're useful for profiling complex graph executions
18+
2. **Building an Example Model**: A simple transformer block with multiple annotated regions
19+
3. **Using `mark_kernels()`**: The key API for annotating kernel regions with semantic labels
20+
4. **Graph Capture with Annotations**: How to enable annotations during CUDA graph capture
21+
5. **Profiling**: Recording execution traces of graph replays
22+
6. **Post-Processing**: Merging annotations back into traces for custom visualization lanes
23+
7. **Visualization**: How to view annotated traces in chrome://tracing
24+
8. **Troubleshooting**: Common issues and solutions
25+
26+
## Key Features
27+
28+
- **Complete end-to-end workflow**: From model definition to visualization
29+
- **Practical example**: Uses a realistic transformer block
30+
- **Custom stream assignments**: Shows how to organize kernels into semantic lanes
31+
- **Before/after comparison**: Demonstrates the value of annotations
32+
- **Comprehensive documentation**: Includes requirements, troubleshooting, and advanced usage
33+
34+
## Tutorial Style
35+
36+
The tutorial follows PyTorch tutorials conventions:
37+
- Uses Sphinx Gallery format with docstring sections
38+
- Includes grid cards for learning objectives and prerequisites
39+
- Code is well-commented and organized into logical sections
40+
- Provides practical, runnable examples
41+
- Explains both the "how" and "why"
42+
43+
## Integration
44+
45+
The tutorial has been integrated into:
46+
1. Main index (`index.rst`) under Model Optimization/Profiling
47+
2. Deep Dive section (`deep-dive.rst`) alongside other profiling tutorials
48+
3. Tagged appropriately: `Model-Optimization`, `Best-Practice`, `Profiling`, `CUDA`
49+
50+
## Building
51+
52+
To build just this tutorial:
53+
```bash
54+
cd tutorials
55+
GALLERY_PATTERN="cuda_graph_annotations_tutorial.py" make html
56+
```
57+
58+
To build all tutorials:
59+
```bash
60+
cd tutorials
61+
make docs
62+
```
63+
64+
## Requirements
65+
66+
- PyTorch 2.0+
67+
- CUDA-capable GPU
68+
- Driver/CUDA-compat >= 13.1 for full annotation support
69+
- cuda-python package (`pip install cuda-python`)
70+
71+
Note: The tutorial gracefully handles older drivers by detecting annotation support and providing appropriate messages.
72+
73+
## Next Steps
74+
75+
1. Test the full build in CI
76+
2. Add a thumbnail image to `_static/img/thumbnails/cropped/`
77+
3. Verify the tutorial renders correctly in the documentation site
78+
4. Consider adding more advanced examples (multi-GPU, distributed training)
130 KB
Loading
312 KB
Loading

advanced_source/cuda_graph_annotations_tutorial.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,9 @@ def main():
402402
print("the semantic kernel lanes.")
403403
print("="*60)
404404

405+
if __name__ == "__main__":
406+
main()
407+
405408
###############################################################################
406409
# Visualizing Results
407410
# -------------------
@@ -490,6 +493,3 @@ def main():
490493
# This technique is especially valuable for large models with many components,
491494
# distributed training setups, or any scenario where understanding the
492495
# execution structure is critical for performance optimization.
493-
494-
if __name__ == "__main__":
495-
main()
246 Bytes
Binary file not shown.
4.13 KB
Binary file not shown.
4.23 KB
Binary file not shown.
246 Bytes
Binary file not shown.

traces/trace_annotated.json.gz

4.11 KB
Binary file not shown.

traces/trace_raw.json.gz

4.22 KB
Binary file not shown.

0 commit comments

Comments
 (0)