|
| 1 | +# CUDA Graph Annotations Tutorial |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +A new tutorial has been added to the PyTorch tutorials repository that demonstrates how to use CUDA graph kernel annotations for enhanced profiling and visualization. |
| 6 | + |
| 7 | +## File Location |
| 8 | + |
| 9 | +- **Tutorial file**: `advanced_source/cuda_graph_annotations_tutorial.py` |
| 10 | +- **Added to index**: `index.rst` (line ~518) |
| 11 | +- **Added to deep-dive**: `deep-dive.rst` (profiling section) |
| 12 | + |
| 13 | +## Tutorial Content |
| 14 | + |
| 15 | +The tutorial covers: |
| 16 | + |
| 17 | +1. **Introduction to CUDA Graph Annotations**: Why they're useful for profiling complex graph executions |
| 18 | +2. **Building an Example Model**: A simple transformer block with multiple annotated regions |
| 19 | +3. **Using `mark_kernels()`**: The key API for annotating kernel regions with semantic labels |
| 20 | +4. **Graph Capture with Annotations**: How to enable annotations during CUDA graph capture |
| 21 | +5. **Profiling**: Recording execution traces of graph replays |
| 22 | +6. **Post-Processing**: Merging annotations back into traces for custom visualization lanes |
| 23 | +7. **Visualization**: How to view annotated traces in chrome://tracing |
| 24 | +8. **Troubleshooting**: Common issues and solutions |
| 25 | + |
| 26 | +## Key Features |
| 27 | + |
| 28 | +- **Complete end-to-end workflow**: From model definition to visualization |
| 29 | +- **Practical example**: Uses a realistic transformer block |
| 30 | +- **Custom stream assignments**: Shows how to organize kernels into semantic lanes |
| 31 | +- **Before/after comparison**: Demonstrates the value of annotations |
| 32 | +- **Comprehensive documentation**: Includes requirements, troubleshooting, and advanced usage |
| 33 | + |
| 34 | +## Tutorial Style |
| 35 | + |
| 36 | +The tutorial follows PyTorch tutorials conventions: |
| 37 | +- Uses Sphinx Gallery format with docstring sections |
| 38 | +- Includes grid cards for learning objectives and prerequisites |
| 39 | +- Code is well-commented and organized into logical sections |
| 40 | +- Provides practical, runnable examples |
| 41 | +- Explains both the "how" and "why" |
| 42 | + |
| 43 | +## Integration |
| 44 | + |
| 45 | +The tutorial has been integrated into: |
| 46 | +1. Main index (`index.rst`) under Model Optimization/Profiling |
| 47 | +2. Deep Dive section (`deep-dive.rst`) alongside other profiling tutorials |
| 48 | +3. Tagged appropriately: `Model-Optimization`, `Best-Practice`, `Profiling`, `CUDA` |
| 49 | + |
| 50 | +## Building |
| 51 | + |
| 52 | +To build just this tutorial: |
| 53 | +```bash |
| 54 | +cd tutorials |
| 55 | +GALLERY_PATTERN="cuda_graph_annotations_tutorial.py" make html |
| 56 | +``` |
| 57 | + |
| 58 | +To build all tutorials: |
| 59 | +```bash |
| 60 | +cd tutorials |
| 61 | +make docs |
| 62 | +``` |
| 63 | + |
| 64 | +## Requirements |
| 65 | + |
| 66 | +- PyTorch 2.0+ |
| 67 | +- CUDA-capable GPU |
| 68 | +- Driver/CUDA-compat >= 13.1 for full annotation support |
| 69 | +- cuda-python package (`pip install cuda-python`) |
| 70 | + |
| 71 | +Note: The tutorial gracefully handles older drivers by detecting annotation support and providing appropriate messages. |
| 72 | + |
| 73 | +## Next Steps |
| 74 | + |
| 75 | +1. Test the full build in CI |
| 76 | +2. Add a thumbnail image to `_static/img/thumbnails/cropped/` |
| 77 | +3. Verify the tutorial renders correctly in the documentation site |
| 78 | +4. Consider adding more advanced examples (multi-GPU, distributed training) |
0 commit comments