Skip to content

gdb.rocm/names.exp: build test program with optimizations#169

Draft
simark wants to merge 1 commit into
ROCm:amd-stagingfrom
simark:build-names-with-opt
Draft

gdb.rocm/names.exp: build test program with optimizations#169
simark wants to merge 1 commit into
ROCm:amd-stagingfrom
simark:build-names-with-opt

Conversation

@simark

@simark simark commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

When running on gfx90a, the test program for gdb.rocm/names.exp hangs when running all kernels:

$ GPU_MAX_HW_QUEUES=32 testsuite/outputs/gdb.rocm/names/names all
*hangs*

When debugging with GDB, I see 4 running waves waiting in function wait_all_kernels, and kernels_seen is 4, meaning that the other kernels didn't start. I noticed that if I compiled the test program at -O2, it ran fine. Claude identified this as a resource issue:

● I extracted the kernel metadata from both binaries to compare. Here's the cause:

  names-0 (-O0), per kernel:
  - vgpr_count: 43, sgpr_count: 42
  - private_segment_fixed_size: 72 (or 88/104)
  - uses_dynamic_stack: true

  names-2 (-O2), per kernel:
  - vgpr_count: 3, sgpr_count: 9
  - private_segment_fixed_size: 0
  - uses_dynamic_stack: false

  At -O2 each kernel uses 3 VGPRs and no scratch, so the GPU happily
  runs all 9 waves concurrently and they unblock each other. At -O0 every
  kernel needs a scratch frame (private_segment_fixed_size > 0,
  uses_dynamic_stack: true) — HIP/ROCm has to allocate a scratch buffer
  per queue, sized for the max possible occupancy. On gfx90a with 9
  streams all demanding scratch (plus the higher VGPR/SGPR footprint), the
  runtime can only get 4 waves resident at a time. The thread state you
  posted confirms exactly that: 4 waves spinning at wait_all_kernels, the
  other 5 never get launched, kernels_seen plateaus at 4, deadlock.

  Why -O0 brought scratch in: with optimization off, the small
  locals in wait_all_kernels plus the call frame for the non-inlined call
  don't fit in registers, so the compiler spills to private memory and the
  kernel becomes a "real" call frame (dynamic stack) instead of a leaf
  with everything in registers.

Build with -O2 so all kernels can be in flight concurrently.

At -O2, the argument values appear to be optimized out, so update the expected frame function string. I don't think that it matters for this test that we are losing the actual argument values. I just hope that they will indeed always be optimized out.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@simark simark requested a review from a team as a code owner June 11, 2026 20:20
@simark simark requested a review from palves June 11, 2026 20:20

@amd-bfilipov amd-bfilipov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@simark

simark commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@lancesix do you have suggestions for the commit message, so that I don't spread out lies?

@lancesix

Copy link
Copy Markdown
Collaborator

@lancesix do you have suggestions for the commit message, so that I don't spread out lies?

Hi,

First, I am not an expert on workload placement on the GPU, but a few things which sand out as approximations at best in what Claude produced:

  • "the runtime can only get 4 waves resident at a time":
    • the runtime is not responsible for creating waves, this is eventually handled by the GPU hardware. There is logic on the device to track used and free resourced, I expect that block is the one deciding it can't schedule more work, not the runtime (but yes, the runtime is involved in scratch management, there might be something there).
    • I also expect that "waves" is not the right unit here either. I am quite confident that on your device you could schedule more than 4 waves at a time, each using the maximum *GPR or scratch allocation. If you run "bit_extract.exp", how many waves to do see concurrently? At the very least, starting new work has to be done at workgroup granularity (all waves of a workgroup are guaranteed to exist at the same time). On an architecture with wave64, where I expect there can be up to 1024 work-items per workgroup, you must be able to run at least 16 waves concurrently, regardless of the *GPR or scratch requirements. There might be (obviously is) some subtleties with concurrent dispatches, but waves is not the right unit here.
  • Just a nit, but I think on this architecture VGPRs are allocated in block of 4, so seeing 3 or 43 VGPRs seems a off-by-1 count issue. Similarly, I thought the SGPR were allocated in blocks of 16 (with at least 1 block), so seeing 0 and 42 seems off. But it might be the difference of the descriptor saying "I need at least X SGPRs" v.s. what is actually allocated on HW.

Anyway, all of that being said, I think the test is still valid even if built as -O2, so the change seems good, but some aspects in the AI output seems off (even if onto something).

Something else: this test is not really new, what was added recently is running the program outside of the debugger. Did you start to see failures recently?

When running on gfx90a, the test program for gdb.rocm/names.exp hangs
when running all kernels:

    $ GPU_MAX_HW_QUEUES=32 testsuite/outputs/gdb.rocm/names/names all
    *hangs*

When debugging with GDB, I see 4 running waves waiting in
function wait_all_kernels, and kernels_seen is 4, meaning that the other
kernels didn't start.  I noticed that if I compiled the test program at
-O2, it ran fine.  Claude identified this as a resource issue:

    ● I extracted the kernel metadata from both binaries to compare. Here's the cause:

      names-0 (-O0), per kernel:
      - vgpr_count: 43, sgpr_count: 42
      - private_segment_fixed_size: 72 (or 88/104)
      - uses_dynamic_stack: true

      names-2 (-O2), per kernel:
      - vgpr_count: 3, sgpr_count: 9
      - private_segment_fixed_size: 0
      - uses_dynamic_stack: false

Presumably, at -O0, each wave (or queue?) consumes more resources and we
can't get all 9 on the hardware at the same time.  But I am not
knowledgeable enough to say if that's correct.

Fix it by building with -O2 so all kernels can be in flight
concurrently.

At -O2, the argument values appear to be optimized out, so update the
expected frame function string.  I don't think that it matters for this
test that we are losing the actual argument values.  I just hope that
they will indeed always be optimized out so that we won't get FAILs
because of that (worst case we can expect .* for the argument values).

Change-Id: Ida07faf8b07bf9e0b21ce16bb6af7b019f01588b
@simark simark force-pushed the build-names-with-opt branch from b48c2f8 to 8980c20 Compare June 15, 2026 15:32
@simark

simark commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Hi,

First, I am not an expert on workload placement on the GPU, but a few things which sand out as approximations at best in what Claude produced:

* "the runtime can only get 4 waves resident at a time":
  
  * the runtime is not responsible for creating waves, this is eventually handled by the GPU hardware.  There is logic on the device to track used and free resourced, I expect that block is the one deciding it can't schedule more work, not the runtime (but yes, the runtime is involved in scratch management, there might be something there).

Yeah, the important point is that (presumably) each wave or queue or whatever consumes too much resource, and we can't get more than 4 on the hardware (but yeah that seems awefully low).

  * I also expect that "waves" is not the right unit here either.  I am quite confident that on your device you could schedule more than 4 waves at a time, each using the maximum *GPR or scratch allocation.  If you run "bit_extract.exp", how many waves to do see concurrently?

I see 2048, but I think that number is bounded by what the test case needs:

  const unsigned blocks = 512;
  const unsigned threadsPerBlock = 256;

I think that requires 4 waves per block, times 512 -> 2048.

At the very least, starting new work has to be done at workgroup granularity (all waves of a workgroup are guaranteed to exist at the same time). On an architecture with wave64, where I expect there can be up to 1024 work-items per workgroup, you must be able to run at least 16 waves concurrently, regardless of the *GPR or scratch requirements. There might be (obviously is) some subtleties with concurrent dispatches, but waves is not the right unit here.

Makes sense. The requirements for the bit_extract kernel are not much different than for "names" at -O0:

    .private_segment_fixed_size: 224
    .sgpr_count:     42
    .uses_dynamic_stack: true
    .vgpr_count:     41

So yeah, it's perhaps more related to the fact that we try to submit work through 9 queues simultaneously... I really don't know.

* Just a nit, but I think on this architecture VGPRs are allocated in block of 4, so seeing 3 or 43 VGPRs seems a off-by-1 count issue.  Similarly, I thought the SGPR were allocated in blocks of 16 (with at least 1 block), so seeing 0 and 42 seems off.  But it might be the difference of the descriptor saying "I need at least X SGPRs" v.s. what is actually allocated on HW.

Yeah, this is from the NT_AMDGPU_METADATAnote in the ELF.

Anyway, all of that being said, I think the test is still valid even if built as -O2, so the change seems good, but some aspects in the AI output seems off (even if onto something).

Rather than spread AI lies, I changed the commit message to avoid sounding authoritative and just say approximately: "it sounds like a resource limitation problem but I am not sure". I force-pushed with that change.

Do you think it's worth digging into why on this particular card we can only get 4 such waves concurrently, like could it reveal some bug that could be addressed?

Something else: this test is not really new, what was added recently is running the program outside of the debugger. Did you start to see failures recently?

It's very possible that it failed before and I didn't have the mental headroom to start looking into it.

@palves

palves commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Do you think it's worth digging into why on this particular card we can only get 4 such waves concurrently, like could it reveal some bug that could be addressed?

I absolutely do. Please don't merge this, we should understand this better. This very same testcase has exposed firmware problems before on other gfx cards (since fixed).

4 is the default for GPU_MAX_HW_QUEUES, so your "get 4 such waves concurrently" sounds very suspect.

This has also exposed similar problems on Windows. The last time I investigated the problems there, I saw something similar to what you're seeing, with optimization making the testcase go further along (though not fully fix). So the test hangs on Windows on gfx1201, and passes cleanly on Linux on the same gfx1201, strongly suggesting that this is a software problem somewhere in the stack.

From all the experience with problems this testcase exposed so far, this seems like one more such case.

The testcase itself is about kernel names yes, so this GPU_MAX_HW_QUEUES thing ends up tangencial, but then again, I think there is good value in that part of the test, and if we moved the launch-more-than-4-kernels-in-parallel part to a separate testcase, we'd see the same problems there. So I think "not important for the test" shouldn't carry weight here. There's probably a real problem somewhere that should be investigated.

@simark simark marked this pull request as draft June 15, 2026 18:06
@simark

simark commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Moved back to draft so we don't consider merging this yet. I opened an internal ticket to ask if the behavior I see with -O0 is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants