Add conv_bwd_weight illegal-memory-access debugging example (#227) by FindHao · Pull Request #227 · facebookexperimental/CUTracer

FindHao · 2026-06-12T19:31:39Z

Summary:

Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a
GPU illegal memory access down to a single SASS instruction.

The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton
kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on
Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even
though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace
captures the per-lane out-of-bounds addresses at the faulting load, and
reg_trace traces that address back to its source registers, pinning the bug on
ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a
predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1
on the same PTX yields 0 sanitizer errors.

Contents: a standalone torch+triton reproducer plus sample mem_addr_trace /
reg_trace / PTX-vs-SASS excerpts and a README.

Authored with the assistance of an AI coding agent.

Reviewed By: warrendeng

Differential Revision: D108450044

meta-codesync · 2026-06-12T19:31:48Z

@FindHao has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108450044.

Copilot

Pull request overview

Adds a new CUTracer case-study example demonstrating how to localize a Blackwell (sm100) convolution backward-weight illegal memory access down to a single SASS instruction, using CUTracer traces and excerpts.

Changes:

Add examples/conv_bwd_weight_ima/ with a standalone torch+triton reproducer kernel for the failing autotune config.
Add a walkthrough README plus sample mem_addr_trace / reg_trace / PTX-vs-SASS excerpts used to pinpoint the root cause.
Include captured trace excerpts that illustrate the per-lane OOB addresses and their register provenance.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
examples/conv_bwd_weight_ima/README.md	Walkthrough and reproduction steps for the IMA debugging workflow
examples/conv_bwd_weight_ima/mini_repro_bwd_weight.py	Standalone torch+triton reproducer kernel
examples/conv_bwd_weight_ima/example_mem_addr_excerpt.txt	Sample `mem_addr_trace` excerpt showing per-lane addresses at the faulting PC
examples/conv_bwd_weight_ima/example_reg_trace_excerpt.txt	Sample `reg_trace` excerpt tracing the bad address back to registers
examples/conv_bwd_weight_ima/example_ptx_vs_sass.txt	PTX masking vs miscompiled SASS excerpt illustrating the suspected `ptxas` issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+```bash
+buck2 run //triton/tools/CUTracer:cutracer -c fbcode.nvcc_arch=b200a -- trace \
+    --instrument mem_addr_trace \
+    --kernel-filters triton_convolution2d_bwd_weight \
+    --output-dir /tmp/ct_mem \
+    -- python mini_repro_bwd_weight.py
+```
+
+Trace the bad address back to its source registers (`reg_trace`):
+
+```bash
+buck2 run //triton/tools/CUTracer:cutracer -c fbcode.nvcc_arch=b200a -- trace \
+    --instrument reg_trace \
+    --kernel-filters triton_convolution2d_bwd_weight \
+    --output-dir /tmp/ct_reg \
+    -- python mini_repro_bwd_weight.py
+```


+    dY = torch.randn(2, 4, 18, 18, device=dev)  # grad_output
+    out = torch.zeros(4, 3, 1, 1, device=dev)  # grad_weight (OUT_C, IN_C, 1, 1)
+
+    grid = (1, 1, 1)  # ceil(3/512), ceil(4/16), GROUPS


Summary: Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a GPU illegal memory access down to a single SASS instruction. The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace captures the per-lane out-of-bounds addresses at the faulting load, and reg_trace traces that address back to its source registers, pinning the bug on ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1 on the same PTX yields 0 sanitizer errors. Contents: a standalone torch+triton reproducer plus sample mem_addr_trace / reg_trace / PTX-vs-SASS excerpts and a README. Authored with the assistance of an AI coding agent. Differential Revision: D108450044

Summary: Adds examples/conv_bwd_weight_ima/, a CUTracer case study for root-causing a GPU illegal memory access down to a single SASS instruction. The example distills pytorch/pytorch#187081: a TorchInductor-generated Triton kernel (triton_convolution2d_bwd_weight) reads ~17 GB out of bounds on Blackwell (sm100). It demonstrates a CUTracer workflow that is useful even though CUTracer forces managed memory and so does not hard-crash: mem_addr_trace captures the per-lane out-of-bounds addresses at the faulting load, and reg_trace traces that address back to its source registers, pinning the bug on ptxas (the PTX masks the cp.async load via src-size=0; ptxas lowers it to a predicated LDGSTS that still dereferences the OOB address). DISABLE_PTXAS_OPT=1 on the same PTX yields 0 sanitizer errors. Contents: a standalone torch+triton reproducer plus sample mem_addr_trace / reg_trace / PTX-vs-SASS excerpts and a README. Authored with the assistance of an AI coding agent. Reviewed By: warrendeng Differential Revision: D108450044

meta-codesync · 2026-06-15T17:30:39Z

This pull request has been merged in 59a36a9.

Copilot AI review requested due to automatic review settings June 12, 2026 19:31

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2026

meta-codesync Bot added the meta-exported label Jun 12, 2026

Copilot started reviewing on behalf of FindHao June 12, 2026 19:32 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Add conv_bwd_weight illegal-memory-access debugging example~~ Add conv_bwd_weight illegal-memory-access debugging example (#227) Jun 12, 2026

meta-codesync Bot force-pushed the export-D108450044 branch from a28109f to 3707753 Compare June 12, 2026 20:37

meta-codesync Bot force-pushed the export-D108450044 branch from 3707753 to 9341d9d Compare June 15, 2026 16:16

meta-codesync Bot closed this in 59a36a9 Jun 15, 2026

meta-codesync Bot added the Merged label Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conv_bwd_weight illegal-memory-access debugging example (#227)#227

Add conv_bwd_weight illegal-memory-access debugging example (#227)#227
FindHao wants to merge 1 commit into
mainfrom
export-D108450044

FindHao commented Jun 12, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

meta-codesync Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FindHao commented Jun 12, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

meta-codesync Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FindHao commented Jun 12, 2026 •

edited by meta-codesync Bot

Loading