This document provides guidelines for writing PR descriptions in the AOTriton project based on analysis of historical PRs merged into upstream/main.
A PR description should follow this structure:
- Overview Section - High-level summary of what the PR does and why
- Major Changes Section - Detailed list of significant changes
- Minor Changes Section (optional) - List of smaller changes
- Known Issues/Problems Section (optional) - Document any limitations or issues
Start with a header # Overview followed by:
- What: A clear, concise explanation of what this PR does (1-3 paragraphs)
- Why: The motivation or context for the change
- Impact: Performance improvements, new features, or API changes if applicable
# Overview
This PR enables kernel pipelining and XCD (compute die) remapping
optimizations for gfx950 GPUs (MI350 series), significantly improving
performance of Flash Attention kernels.
This update delivers 904 TFLOPS performance on MI355X with mainline Triton JIT,
improved from 753 TFLOPS with the same Triton JIT process without pipelining.# Overview
This PR changes the API for varlen inputs and expects compact logsumexp
(LSE) tensor, following Tri's FlashAttention API.
Previously for varlen inputs with `B` sequences, `H` heads, AOTriton
expects a regular-sized LSE Tensor: `(B*H, Max_sequence_length)`.
However this approach requires padding at last dimension and wasting
memory when the sequence lengths change drastically within a batch.Use header ## Major Changes or # Major Changes followed by a bulleted list.
-
Use
* [category] Descriptionformat where category indicates the affected component:[kernel]- Triton kernel changes[api]- API changes (mark as**BREAKING**if breaking)[shim]- C++ shim layer changes[build]- Build system changes[test]- Test infrastructure changes[db]- Database changes[codegen]- Code generation changes[rules]- Kernel selection rules[tritonsrc]- Triton source code changes[ci]- CI/CD changes[binding]- Python binding changes[compiler]- Compiler changes[tune]- Tuning system changes
-
Use sub-bullets with
+for details under each major point -
Use inline code formatting for:
- Function/class names:
`attn_fwd` - File names:
`config.h` - Env vars:
`AOTRITON_SKIP_LUT_CHECK` - Options/flags:
`num_stages=2` - Constants:
`VarlenType::PaddedVarlen`
- Function/class names:
## Major Changes
* [api] **BREAKING** Use compact LSE for varlen inputs
* [kernel] Use compact LSE and Delta Tensor
* [kernel] `head_dim` argument in all SDPA Triton kernels is replaced by
`hdim_qk` and `hdim_vo` arguments to support this feature.
* [shim] The dispatcher will use the last dimensions of `Q` and `V` tensor as
`hdim_qk` and `hdim_vo`
* [test] Add test cases to test `hdim_qk != hdim_vo`. New cases are added to
+ `test_fast` case (`FOR_RELEASE=0`)
+ `test_hdim_qk_ne_vo` (`FOR_RELEASE=3`)Use header ## Minor Changes or # Minor Changes.
Follow similar categorization and formatting as Major Changes, but for:
- Refactoring that doesn't change functionality
- Documentation updates
- Small bug fixes
- Tool improvements
- Code cleanup
## Minor Changes
* [test] Adjust unit tests for varlen compact LSE tensor
* [docs] Update comments about input dimensions in V2 header file
* [build] Replace `git log -1 --format=%H` with `git rev-parse HEAD`Use header ## Known Issues, ## Known Problems, or # Known Issues.
Document:
- Features not yet implemented
- Database updates pending
- Platform-specific issues
- Performance concerns
- Test coverage gaps
- Temporary workarounds needed
## Known Issues
* [db] The operator tuning database is not updated.
* [test] The number of functionals supported by FWD AITER ASM kernels are
fairly limited.
* [aiter] The following AITER ASM functions need more investigation:
- "Group mode" AITER ASM kernels to support varlen
- MQA/GQA supportIf needed, add a ## Notes section for important clarifications:
## Notes
* The code uses `hdim_vo` instead of `hdim_v` for better alignment with
`hdim_qk`, and emphasizes the output tensor (`o`) should follow Tensor `V`'s
head dimension.- Mark as
**BREAKING**in the item description - Clearly explain the old vs new behavior
- Provide migration guidance if applicable
- Include concrete numbers when available
- Specify test conditions/commands used to measure
- Use TFLOPS or other relevant metrics
- Always note if database is updated or pending update
- Mention affected architectures
- Ignore v2src and v2python: These directories are deprecated
- Do not document changes to deprecated components unless they affect non-deprecated code
Use triple backticks for:
- Command examples
- Configuration snippets
- Code samples
- Link to related issues:
This fixes #98 - Reference external docs when relevant
- Be concise but complete
- Use technical terminology correctly
- Focus on what changed and why it matters
- Avoid redundant information
- Use imperative mood for changes ("Add support", not "Added support")
- Group related changes together under the same bullet
Before submitting a PR description:
- Overview explains the "what" and "why"
- Major changes are properly categorized
- Breaking changes are clearly marked
- Performance impacts are quantified when available
- Known issues are documented
- Code/file names use inline code formatting
- No changes to v2src/v2python are documented (deprecated)