Skip to content

Add conv_bwd_weight illegal-memory-access debugging example (#227)#227

Closed
FindHao wants to merge 1 commit into
mainfrom
export-D108450044
Closed

Add conv_bwd_weight illegal-memory-access debugging example (#227)#227
FindHao wants to merge 1 commit into
mainfrom
export-D108450044

Conversation

@FindHao

@FindHao FindHao commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary:

Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a
GPU illegal memory access down to a single SASS instruction.

The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton
kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on
Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even
though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace
captures the per-lane out-of-bounds addresses at the faulting load, and
reg_trace traces that address back to its source registers, pinning the bug on
ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a
predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1
on the same PTX yields 0 sanitizer errors.

Contents: a standalone torch+triton reproducer plus sample mem_addr_trace /
reg_trace / PTX-vs-SASS excerpts and a README.

Authored with the assistance of an AI coding agent.

Reviewed By: warrendeng

Differential Revision: D108450044

Copilot AI review requested due to automatic review settings June 12, 2026 19:31
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2026
@meta-codesync

meta-codesync Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@FindHao has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108450044.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CUTracer case-study example demonstrating how to localize a Blackwell (sm100) convolution backward-weight illegal memory access down to a single SASS instruction, using CUTracer traces and excerpts.

Changes:

  • Add examples/conv_bwd_weight_ima/ with a standalone torch+triton reproducer kernel for the failing autotune config.
  • Add a walkthrough README plus sample mem_addr_trace / reg_trace / PTX-vs-SASS excerpts used to pinpoint the root cause.
  • Include captured trace excerpts that illustrate the per-lane OOB addresses and their register provenance.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
examples/conv_bwd_weight_ima/README.md Walkthrough and reproduction steps for the IMA debugging workflow
examples/conv_bwd_weight_ima/mini_repro_bwd_weight.py Standalone torch+triton reproducer kernel
examples/conv_bwd_weight_ima/example_mem_addr_excerpt.txt Sample mem_addr_trace excerpt showing per-lane addresses at the faulting PC
examples/conv_bwd_weight_ima/example_reg_trace_excerpt.txt Sample reg_trace excerpt tracing the bad address back to registers
examples/conv_bwd_weight_ima/example_ptx_vs_sass.txt PTX masking vs miscompiled SASS excerpt illustrating the suspected ptxas issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +84
```bash
buck2 run //triton/tools/CUTracer:cutracer -c fbcode.nvcc_arch=b200a -- trace \
--instrument mem_addr_trace \
--kernel-filters triton_convolution2d_bwd_weight \
--output-dir /tmp/ct_mem \
-- python mini_repro_bwd_weight.py
```

Trace the bad address back to its source registers (`reg_trace`):

```bash
buck2 run //triton/tools/CUTracer:cutracer -c fbcode.nvcc_arch=b200a -- trace \
--instrument reg_trace \
--kernel-filters triton_convolution2d_bwd_weight \
--output-dir /tmp/ct_reg \
-- python mini_repro_bwd_weight.py
```
dY = torch.randn(2, 4, 18, 18, device=dev) # grad_output
out = torch.zeros(4, 3, 1, 1, device=dev) # grad_weight (OUT_C, IN_C, 1, 1)

grid = (1, 1, 1) # ceil(3/512), ceil(4/16), GROUPS
@meta-codesync meta-codesync Bot changed the title Add conv_bwd_weight illegal-memory-access debugging example Add conv_bwd_weight illegal-memory-access debugging example (#227) Jun 12, 2026
meta-codesync Bot pushed a commit that referenced this pull request Jun 12, 2026
Summary:

Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a
GPU illegal memory access down to a single SASS instruction.

The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton
kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on
Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even
though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace
captures the per-lane out-of-bounds addresses at the faulting load, and
reg_trace traces that address back to its source registers, pinning the bug on
ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a
predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1
on the same PTX yields 0 sanitizer errors.

Contents: a standalone torch+triton reproducer plus sample mem_addr_trace /
reg_trace / PTX-vs-SASS excerpts and a README.

Authored with the assistance of an AI coding agent.

Differential Revision: D108450044
@meta-codesync meta-codesync Bot force-pushed the export-D108450044 branch from a28109f to 3707753 Compare June 12, 2026 20:37
Summary:

Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a
GPU illegal memory access down to a single SASS instruction.

The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton
kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on
Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even
though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace
captures the per-lane out-of-bounds addresses at the faulting load, and
reg_trace traces that address back to its source registers, pinning the bug on
ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a
predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1
on the same PTX yields 0 sanitizer errors.

Contents: a standalone torch+triton reproducer plus sample mem_addr_trace /
reg_trace / PTX-vs-SASS excerpts and a README.

Authored with the assistance of an AI coding agent.

Reviewed By: warrendeng

Differential Revision: D108450044
@meta-codesync meta-codesync Bot force-pushed the export-D108450044 branch from 3707753 to 9341d9d Compare June 15, 2026 16:16
@meta-codesync meta-codesync Bot closed this in 59a36a9 Jun 15, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 15, 2026
@meta-codesync

meta-codesync Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in 59a36a9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants