gdb.rocm/names.exp: build test program with optimizations by simark · Pull Request #169 · ROCm/ROCgdb

simark · 2026-06-11T20:20:04Z

When running on gfx90a, the test program for gdb.rocm/names.exp hangs when running all kernels:

$ GPU_MAX_HW_QUEUES=32 testsuite/outputs/gdb.rocm/names/names all
*hangs*

When debugging with GDB, I see 4 running waves waiting in function wait_all_kernels, and kernels_seen is 4, meaning that the other kernels didn't start. I noticed that if I compiled the test program at -O2, it ran fine. Claude identified this as a resource issue:

● I extracted the kernel metadata from both binaries to compare. Here's the cause:

  names-0 (-O0), per kernel:
  - vgpr_count: 43, sgpr_count: 42
  - private_segment_fixed_size: 72 (or 88/104)
  - uses_dynamic_stack: true

  names-2 (-O2), per kernel:
  - vgpr_count: 3, sgpr_count: 9
  - private_segment_fixed_size: 0
  - uses_dynamic_stack: false

  At -O2 each kernel uses 3 VGPRs and no scratch, so the GPU happily
  runs all 9 waves concurrently and they unblock each other. At -O0 every
  kernel needs a scratch frame (private_segment_fixed_size > 0,
  uses_dynamic_stack: true) — HIP/ROCm has to allocate a scratch buffer
  per queue, sized for the max possible occupancy. On gfx90a with 9
  streams all demanding scratch (plus the higher VGPR/SGPR footprint), the
  runtime can only get 4 waves resident at a time. The thread state you
  posted confirms exactly that: 4 waves spinning at wait_all_kernels, the
  other 5 never get launched, kernels_seen plateaus at 4, deadlock.

  Why -O0 brought scratch in: with optimization off, the small
  locals in wait_all_kernels plus the call frame for the non-inlined call
  don't fit in registers, so the compiler spills to private memory and the
  kernel becomes a "real" call frame (dynamic stack) instead of a leaf
  with everything in registers.

Build with -O2 so all kernels can be in flight concurrently.

At -O2, the argument values appear to be optimized out, so update the expected frame function string. I don't think that it matters for this test that we are losing the actual argument values. I just hope that they will indeed always be optimized out.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

amd-bfilipov

Looks good

simark · 2026-06-12T19:27:18Z

@lancesix do you have suggestions for the commit message, so that I don't spread out lies?

lancesix · 2026-06-15T09:23:55Z

@lancesix do you have suggestions for the commit message, so that I don't spread out lies?

Hi,

First, I am not an expert on workload placement on the GPU, but a few things which sand out as approximations at best in what Claude produced:

"the runtime can only get 4 waves resident at a time":
- the runtime is not responsible for creating waves, this is eventually handled by the GPU hardware. There is logic on the device to track used and free resourced, I expect that block is the one deciding it can't schedule more work, not the runtime (but yes, the runtime is involved in scratch management, there might be something there).
- I also expect that "waves" is not the right unit here either. I am quite confident that on your device you could schedule more than 4 waves at a time, each using the maximum *GPR or scratch allocation. If you run "bit_extract.exp", how many waves to do see concurrently? At the very least, starting new work has to be done at workgroup granularity (all waves of a workgroup are guaranteed to exist at the same time). On an architecture with wave64, where I expect there can be up to 1024 work-items per workgroup, you must be able to run at least 16 waves concurrently, regardless of the *GPR or scratch requirements. There might be (obviously is) some subtleties with concurrent dispatches, but waves is not the right unit here.
Just a nit, but I think on this architecture VGPRs are allocated in block of 4, so seeing 3 or 43 VGPRs seems a off-by-1 count issue. Similarly, I thought the SGPR were allocated in blocks of 16 (with at least 1 block), so seeing 0 and 42 seems off. But it might be the difference of the descriptor saying "I need at least X SGPRs" v.s. what is actually allocated on HW.

Anyway, all of that being said, I think the test is still valid even if built as -O2, so the change seems good, but some aspects in the AI output seems off (even if onto something).

Something else: this test is not really new, what was added recently is running the program outside of the debugger. Did you start to see failures recently?

When running on gfx90a, the test program for gdb.rocm/names.exp hangs when running all kernels: $ GPU_MAX_HW_QUEUES=32 testsuite/outputs/gdb.rocm/names/names all *hangs* When debugging with GDB, I see 4 running waves waiting in function wait_all_kernels, and kernels_seen is 4, meaning that the other kernels didn't start. I noticed that if I compiled the test program at -O2, it ran fine. Claude identified this as a resource issue: ● I extracted the kernel metadata from both binaries to compare. Here's the cause: names-0 (-O0), per kernel: - vgpr_count: 43, sgpr_count: 42 - private_segment_fixed_size: 72 (or 88/104) - uses_dynamic_stack: true names-2 (-O2), per kernel: - vgpr_count: 3, sgpr_count: 9 - private_segment_fixed_size: 0 - uses_dynamic_stack: false Presumably, at -O0, each wave (or queue?) consumes more resources and we can't get all 9 on the hardware at the same time. But I am not knowledgeable enough to say if that's correct. Fix it by building with -O2 so all kernels can be in flight concurrently. At -O2, the argument values appear to be optimized out, so update the expected frame function string. I don't think that it matters for this test that we are losing the actual argument values. I just hope that they will indeed always be optimized out so that we won't get FAILs because of that (worst case we can expect .* for the argument values). Change-Id: Ida07faf8b07bf9e0b21ce16bb6af7b019f01588b

simark · 2026-06-15T15:57:08Z

Hi,

First, I am not an expert on workload placement on the GPU, but a few things which sand out as approximations at best in what Claude produced:

* "the runtime can only get 4 waves resident at a time":
  
  * the runtime is not responsible for creating waves, this is eventually handled by the GPU hardware.  There is logic on the device to track used and free resourced, I expect that block is the one deciding it can't schedule more work, not the runtime (but yes, the runtime is involved in scratch management, there might be something there).

Yeah, the important point is that (presumably) each wave or queue or whatever consumes too much resource, and we can't get more than 4 on the hardware (but yeah that seems awefully low).

  * I also expect that "waves" is not the right unit here either.  I am quite confident that on your device you could schedule more than 4 waves at a time, each using the maximum *GPR or scratch allocation.  If you run "bit_extract.exp", how many waves to do see concurrently?

I see 2048, but I think that number is bounded by what the test case needs:

  const unsigned blocks = 512;
  const unsigned threadsPerBlock = 256;

I think that requires 4 waves per block, times 512 -> 2048.

At the very least, starting new work has to be done at workgroup granularity (all waves of a workgroup are guaranteed to exist at the same time). On an architecture with wave64, where I expect there can be up to 1024 work-items per workgroup, you must be able to run at least 16 waves concurrently, regardless of the *GPR or scratch requirements. There might be (obviously is) some subtleties with concurrent dispatches, but waves is not the right unit here.

Makes sense. The requirements for the bit_extract kernel are not much different than for "names" at -O0:

    .private_segment_fixed_size: 224
    .sgpr_count:     42
    .uses_dynamic_stack: true
    .vgpr_count:     41

So yeah, it's perhaps more related to the fact that we try to submit work through 9 queues simultaneously... I really don't know.

* Just a nit, but I think on this architecture VGPRs are allocated in block of 4, so seeing 3 or 43 VGPRs seems a off-by-1 count issue.  Similarly, I thought the SGPR were allocated in blocks of 16 (with at least 1 block), so seeing 0 and 42 seems off.  But it might be the difference of the descriptor saying "I need at least X SGPRs" v.s. what is actually allocated on HW.

Yeah, this is from the NT_AMDGPU_METADATAnote in the ELF.

Anyway, all of that being said, I think the test is still valid even if built as -O2, so the change seems good, but some aspects in the AI output seems off (even if onto something).

Rather than spread AI lies, I changed the commit message to avoid sounding authoritative and just say approximately: "it sounds like a resource limitation problem but I am not sure". I force-pushed with that change.

Do you think it's worth digging into why on this particular card we can only get 4 such waves concurrently, like could it reveal some bug that could be addressed?

Something else: this test is not really new, what was added recently is running the program outside of the debugger. Did you start to see failures recently?

It's very possible that it failed before and I didn't have the mental headroom to start looking into it.

palves · 2026-06-15T16:53:04Z

Do you think it's worth digging into why on this particular card we can only get 4 such waves concurrently, like could it reveal some bug that could be addressed?

I absolutely do. Please don't merge this, we should understand this better. This very same testcase has exposed firmware problems before on other gfx cards (since fixed).

4 is the default for GPU_MAX_HW_QUEUES, so your "get 4 such waves concurrently" sounds very suspect.

This has also exposed similar problems on Windows. The last time I investigated the problems there, I saw something similar to what you're seeing, with optimization making the testcase go further along (though not fully fix). So the test hangs on Windows on gfx1201, and passes cleanly on Linux on the same gfx1201, strongly suggesting that this is a software problem somewhere in the stack.

From all the experience with problems this testcase exposed so far, this seems like one more such case.

The testcase itself is about kernel names yes, so this GPU_MAX_HW_QUEUES thing ends up tangencial, but then again, I think there is good value in that part of the test, and if we moved the launch-more-than-4-kernels-in-parallel part to a separate testcase, we'd see the same problems there. So I think "not important for the test" shouldn't carry weight here. There's probably a real problem somewhere that should be investigated.

simark · 2026-06-15T18:07:12Z

Moved back to draft so we don't consider merging this yet. I opened an internal ticket to ask if the behavior I see with -O0 is expected.

simark requested a review from a team as a code owner June 11, 2026 20:20

simark requested a review from palves June 11, 2026 20:20

amd-bfilipov approved these changes Jun 12, 2026

View reviewed changes

simark force-pushed the build-names-with-opt branch from b48c2f8 to 8980c20 Compare June 15, 2026 15:32

simark marked this pull request as draft June 15, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gdb.rocm/names.exp: build test program with optimizations#169

gdb.rocm/names.exp: build test program with optimizations#169
simark wants to merge 1 commit into
ROCm:amd-stagingfrom
simark:build-names-with-opt

simark commented Jun 11, 2026 •

edited

Loading

Uh oh!

amd-bfilipov left a comment

Uh oh!

simark commented Jun 12, 2026

Uh oh!

lancesix commented Jun 15, 2026

Uh oh!

simark commented Jun 15, 2026

Uh oh!

palves commented Jun 15, 2026

Uh oh!

simark commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

simark commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

amd-bfilipov left a comment

Choose a reason for hiding this comment

Uh oh!

simark commented Jun 12, 2026

Uh oh!

lancesix commented Jun 15, 2026

Uh oh!

simark commented Jun 15, 2026

Uh oh!

palves commented Jun 15, 2026

Uh oh!

simark commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simark commented Jun 11, 2026 •

edited

Loading