gdb.rocm/names.exp: build test program with optimizations#169
Conversation
|
@lancesix do you have suggestions for the commit message, so that I don't spread out lies? |
Hi, First, I am not an expert on workload placement on the GPU, but a few things which sand out as approximations at best in what Claude produced:
Anyway, all of that being said, I think the test is still valid even if built as Something else: this test is not really new, what was added recently is running the program outside of the debugger. Did you start to see failures recently? |
When running on gfx90a, the test program for gdb.rocm/names.exp hangs
when running all kernels:
$ GPU_MAX_HW_QUEUES=32 testsuite/outputs/gdb.rocm/names/names all
*hangs*
When debugging with GDB, I see 4 running waves waiting in
function wait_all_kernels, and kernels_seen is 4, meaning that the other
kernels didn't start. I noticed that if I compiled the test program at
-O2, it ran fine. Claude identified this as a resource issue:
● I extracted the kernel metadata from both binaries to compare. Here's the cause:
names-0 (-O0), per kernel:
- vgpr_count: 43, sgpr_count: 42
- private_segment_fixed_size: 72 (or 88/104)
- uses_dynamic_stack: true
names-2 (-O2), per kernel:
- vgpr_count: 3, sgpr_count: 9
- private_segment_fixed_size: 0
- uses_dynamic_stack: false
Presumably, at -O0, each wave (or queue?) consumes more resources and we
can't get all 9 on the hardware at the same time. But I am not
knowledgeable enough to say if that's correct.
Fix it by building with -O2 so all kernels can be in flight
concurrently.
At -O2, the argument values appear to be optimized out, so update the
expected frame function string. I don't think that it matters for this
test that we are losing the actual argument values. I just hope that
they will indeed always be optimized out so that we won't get FAILs
because of that (worst case we can expect .* for the argument values).
Change-Id: Ida07faf8b07bf9e0b21ce16bb6af7b019f01588b
b48c2f8 to
8980c20
Compare
Yeah, the important point is that (presumably) each wave or queue or whatever consumes too much resource, and we can't get more than 4 on the hardware (but yeah that seems awefully low).
I see 2048, but I think that number is bounded by what the test case needs: I think that requires 4 waves per block, times 512 -> 2048.
Makes sense. The requirements for the bit_extract kernel are not much different than for "names" at -O0: So yeah, it's perhaps more related to the fact that we try to submit work through 9 queues simultaneously... I really don't know.
Yeah, this is from the
Rather than spread AI lies, I changed the commit message to avoid sounding authoritative and just say approximately: "it sounds like a resource limitation problem but I am not sure". I force-pushed with that change. Do you think it's worth digging into why on this particular card we can only get 4 such waves concurrently, like could it reveal some bug that could be addressed?
It's very possible that it failed before and I didn't have the mental headroom to start looking into it. |
I absolutely do. Please don't merge this, we should understand this better. This very same testcase has exposed firmware problems before on other gfx cards (since fixed). 4 is the default for GPU_MAX_HW_QUEUES, so your "get 4 such waves concurrently" sounds very suspect. This has also exposed similar problems on Windows. The last time I investigated the problems there, I saw something similar to what you're seeing, with optimization making the testcase go further along (though not fully fix). So the test hangs on Windows on gfx1201, and passes cleanly on Linux on the same gfx1201, strongly suggesting that this is a software problem somewhere in the stack. From all the experience with problems this testcase exposed so far, this seems like one more such case. The testcase itself is about kernel names yes, so this GPU_MAX_HW_QUEUES thing ends up tangencial, but then again, I think there is good value in that part of the test, and if we moved the launch-more-than-4-kernels-in-parallel part to a separate testcase, we'd see the same problems there. So I think "not important for the test" shouldn't carry weight here. There's probably a real problem somewhere that should be investigated. |
|
Moved back to draft so we don't consider merging this yet. I opened an internal ticket to ask if the behavior I see with -O0 is expected. |
When running on gfx90a, the test program for gdb.rocm/names.exp hangs when running all kernels:
When debugging with GDB, I see 4 running waves waiting in function wait_all_kernels, and kernels_seen is 4, meaning that the other kernels didn't start. I noticed that if I compiled the test program at -O2, it ran fine. Claude identified this as a resource issue:
Build with -O2 so all kernels can be in flight concurrently.
At -O2, the argument values appear to be optimized out, so update the expected frame function string. I don't think that it matters for this test that we are losing the actual argument values. I just hope that they will indeed always be optimized out.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist