Skip to content

gdb/infrun.c: fix handle_step_into_function lane divergence#181

Open
lancesix wants to merge 1 commit into
amd-stagingfrom
users/lancesix/fix-lane-divergence-stepping
Open

gdb/infrun.c: fix handle_step_into_function lane divergence#181
lancesix wants to merge 1 commit into
amd-stagingfrom
users/lancesix/fix-lane-divergence-stepping

Conversation

@lancesix

Copy link
Copy Markdown
Collaborator

Pending compiler changes (current tip of amd-staging of the compiler[1]) shows failures in gdb.rocm/lane-execution.exp:

Running .../gdb/testsuite/gdb.rocm/lane-execution.exp ...
FAIL: gdb.rocm/lane-execution.exp: step: else: step to " if_3_cond " (timeout)
FAIL: gdb.rocm/lane-execution.exp: step: else: step to " if_3_else " (timeout)
FAIL: gdb.rocm/lane-execution.exp: step: else: step to "atomicAdd " (timeout)

In this test, we have:

101 if (gid % 2) /* if_2_cond /
102 elem = foo (const_array[gid]); /
if_2_then /
103 else
104 elem = const_array[gid]; /
if_2_else /
105 /
if_2_end /
106
107 /
This condition is always false. /
108 if (gid == -1) /
if_3_cond */

The program stops at L104 (active lane is 2), set scheduler-locking on and issues a "step" command. Because the compiler emits the "else" branch before the "then" branch, the "step" command will single step through L104, then will change the active lanes and single step through L102. The expectation is that we single step through L102 until we reach L108, which will be the first line with lane 2 active we will execute next.

However, something goes wrong when stepping L102, and the wave will run to completion. Because we have scheduler-locking on, once the wave has terminated, no further events will reach GDB, hanging it, causing the timeout.

When stepping with lane divergence support enabled, GDB tries to single step until the current lane becomes active again and the PC of the wave reaches a new line. If we follow this process, here is what happens:

First GDB steps through L104 as expected, and does not change as it is still on the starting line. We then go to the code changing the EXEC mask (mapped to L101, but not marked as statement so skipped), and end-up on L102. When executing L102, lane 2 is inactive, so we continue single stepping instructions, which includes calling the function foo.

In the function foo, we have:

Dump of assembler code for function _Z3fooi:
.../gdb/testsuite/gdb.rocm/lane-execution.cpp:
68	{
   0x00007ffff5f29dc0 <+0>:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
   0x00007ffff5f29dc4 <+4>:    s_mov_b32 s25, s33
   0x00007ffff5f29dc8 <+8>:    s_mov_b32 s33, s32
   0x00007ffff5f29dcc <+12>:   s_xor_saveexec_b32 s16, -1
   0x00007ffff5f29dd0 <+16>:   buffer_store_dword v3, off, s[0:3], s33 offset:4
   0x00007ffff5f29dd8 <+24>:   s_mov_b32 exec_lo, s16
   [...]

When single stepping the first 5 instructions, because the current lane is inactive, we just ignore the stop and continue single stepping directly. However, the 5th instruction (at offset foo+12) activates all the lanes of the wave for spilling using a full VGPR, so lane 2 is active when executing the buffer_store_dword instruction. As a consequence, GDB cannot dismiss this instruction as the others, and needs to consider it fully.

At this point, in process_event_stop_test, GDB will realize that the current frame changed compared to when we started the step command. The process from this point is quite simple, we just stepped inside a new function, we are almost done. Last thing to do is to skip the function prologue, and continue execution. This is done by handle_step_into_function: we set a breakpoint after the prologue (at foo+104: 0x00007ffff7fa9f28), resume execution and hope to be done.

We resume execution, which executes the buffer_store, then restores the EXEC mask (disabling lane 2) and execute until the end of the prologue. Because lane 2 is now inactive, we have to continue stepping, so we execute all of the foo function. This includes a call to the bar function which never touches the EXEC mask, so nothing important happening there.

After we return from bar, we need to execute the epilogue of function foo which looks like:

   0x00007ffff5f29e80 <+192>:  s_xor_saveexec_b32 s4, -1
   0x00007ffff5f29e84 <+196>:  buffer_load_dword v3, off, s[0:3], s33 offset:4
   0x00007ffff5f29e8c <+204>:  s_mov_b32 exec_lo, s4
   0x00007ffff5f29e90 <+208>:  s_mov_b32 s33, s25
   0x00007ffff5f29e94 <+212>:  s_waitcnt vmcnt(0)
   0x00007ffff5f29e98 <+216>:  s_setpc_b64 s[30:31]
End of assembler dump.

Between offset +192 and +204, we need to restore the state that was spilled during the prologue, so we do the same thing: activate all the lanes, do the load and restore lanes. When stopping on the buffer_load instruction, GDB sees that lane 2 is active again, and gets back to "normal" processing. We see that we are on a frame different from the one we had when starting the step, so are about to stop and try to skip the prologue. However, this goes quite wrong, as shown by the infrun logs:

[infrun] handle_signal_stop: stop_pc=0x7ffff5f29e84
[infrun] process_event_stop_test: stepped into subroutine
[infrun] insert_step_resume_breakpoint_at_sal_1: inserting step-resume breakpoint at 0x7ffff5f29e34
[infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [108550.1.1] at 0x7ffff5f29e84

We insert the breakpoint at the end of the function prologue and resume execution. But the PC is currently in the epilogue, not in the prologue! Because of this, we resume execution without single stepping, with a breakpoint inserted at an instruction we will not execute, which allows the wave to run to completion.

I noted earlier that those failures are regressions with a new compiler, so why didn't we see this issue before? With the previous version of the compiler, the bar function (called by foo) also had some spill / restore logic, changing slightly the sequence. Because bar was also spilling, it has an instruction where lane 2 was active, causing GDB to consider the stop without the lane divergence bypass.

When evaluating if we frame changed, we executed this condition:

if ((get_stack_frame_id (frame)
     != ecs->event_thread->control.step_stack_frame_id)
    && get_frame_type (frame) != SIGTRAMP_FRAME
    && ((frame_unwind_caller_id (frame)
         == ecs->event_thread->control.step_stack_frame_id)
        && ((ecs->event_thread->control.step_stack_frame_id
             != outer_frame_id)
            || !ecs->event_thread->control.in_step_start_function (frame))))

Here, the first two predicates hold (current frame is not the same as when we started stepping, and it is not a SIGTRAMP frame), but things go south when checking

frame_unwind_caller_id (frame) == ecs->event_thread->control.step_stack_frame_id)

This one is not true. Here frame is bar, so the caller is foo, and the step_stack_frame we compare it to is lane_pc_test. As a consequence, this check fails, which gets us to fallback to another case: stepping got us out of the original step_range we set, but the current instruction is not a statement either. So we reset the stepping range information (set_step_info) and continue single stepping through the current range. This gets us all the way through bar (let's ignore the spill-restore bit), then in foo where we also do the spill-restore. Because the same check is done, and the current frame (foo) is not called by where we started (bar), we continue stepping, and eventually finish the operation normally.

So now the question is: how do we fix this? I see 2 approaches for this. One is to reset the "step range" when we see lane 2 active in foo the first time. This would avoid us the incorrect detection of a new frame when doing the spill-restore. However, there is something even more broken exposed with this: why do we allow resuming the wave to skip the function prologue if we are not in the function prologue?

Because the resume seems universally wrong, this patch proposes to fix this first. When doing the resume, check that the thread's PC is not past the end of the prologue before we use
insert_step_resume_breakpoint_at_sal_1. With this simple adjustment, the gdb.rocm/lane-execution.exp testcase passes correctly.

I will explore reseting the stepping range in followup work.

Tested on amdgpu GFX1031.

[1] ROCm/llvm-project@a6fa68a

Change-Id: I7f57c51a430e9a87bddc4877ebddf5deec8a0de8
Bug: LCOMPILER-2337

Pending compiler changes (current tip of `amd-staging` of the
compiler[1]) shows failures in gdb.rocm/lane-execution.exp:

    Running .../gdb/testsuite/gdb.rocm/lane-execution.exp ...
    FAIL: gdb.rocm/lane-execution.exp: step: else: step to " if_3_cond " (timeout)
    FAIL: gdb.rocm/lane-execution.exp: step: else: step to " if_3_else " (timeout)
    FAIL: gdb.rocm/lane-execution.exp: step: else: step to "atomicAdd " (timeout)

In this test, we have:

101  if (gid % 2)					/* if_2_cond */
102    elem = foo (const_array[gid]);		/* if_2_then */
103  else
104    elem = const_array[gid];			/* if_2_else */
105						/* if_2_end  */
106
107  /* This condition is always false.  */
108  if (gid == -1)				/* if_3_cond */

The program stops at L104 (active lane is 2), set scheduler-locking on
and issues a "step" command.  Because the compiler emits the "else"
branch before the "then" branch, the "step" command will single step
through L104, then will change the active lanes and single step through
L102.  The expectation is that we single step through L102 until we
reach L108, which will be the first line with lane 2 active we will
execute next.

However, something goes wrong when stepping L102, and the wave will run
to completion.  Because we have scheduler-locking on, once the wave has
terminated, no further events will reach GDB, hanging it, causing the
timeout.

When stepping with lane divergence support enabled, GDB tries to single
step until the current lane becomes active again and the PC of the wave
reaches a new line.  If we follow this process, here is what happens:

First GDB steps through L104 as expected, and does not change as it is
still on the starting line.  We then go to the code changing the EXEC
mask (mapped to L101, but not marked as statement so skipped), and
end-up on L102.  When executing L102, lane 2 is inactive, so we continue
single stepping instructions, which includes calling the function foo.

In the function foo, we have:

    Dump of assembler code for function _Z3fooi:
    .../gdb/testsuite/gdb.rocm/lane-execution.cpp:
    68	{
       0x00007ffff5f29dc0 <+0>:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
       0x00007ffff5f29dc4 <+4>:    s_mov_b32 s25, s33
       0x00007ffff5f29dc8 <+8>:    s_mov_b32 s33, s32
       0x00007ffff5f29dcc <+12>:   s_xor_saveexec_b32 s16, -1
       0x00007ffff5f29dd0 <+16>:   buffer_store_dword v3, off, s[0:3], s33 offset:4
       0x00007ffff5f29dd8 <+24>:   s_mov_b32 exec_lo, s16
       [...]

When single stepping the first 5 instructions, because the current lane
is inactive, we just ignore the stop and continue single stepping
directly.  However, the 5th instruction (at offset foo+12) activates all
the lanes of the wave for spilling using a full VGPR, so lane 2 is
active when executing the buffer_store_dword instruction.  As a
consequence, GDB cannot dismiss this instruction as the others, and
needs to consider it fully.

At this point, in process_event_stop_test, GDB will realize that the
current frame changed compared to when we started the step command.
The process from this point is quite simple, we just stepped inside a
new function, we are almost done.  Last thing to do is to skip the
function prologue, and continue execution.  This is done by
handle_step_into_function: we set a breakpoint after the prologue (at
foo+104: 0x00007ffff7fa9f28), resume execution and hope to be done.

We resume execution, which executes the buffer_store, then restores the
EXEC mask (disabling lane 2) and execute until the end of the prologue.
Because lane 2 is now inactive, we have to continue stepping, so we
execute all of the foo function.  This includes a call to the bar
function which never touches the EXEC mask, so nothing important
happening there.

After we return from bar, we need to execute the epilogue of function
foo which looks like:

       0x00007ffff5f29e80 <+192>:  s_xor_saveexec_b32 s4, -1
       0x00007ffff5f29e84 <+196>:  buffer_load_dword v3, off, s[0:3], s33 offset:4
       0x00007ffff5f29e8c <+204>:  s_mov_b32 exec_lo, s4
       0x00007ffff5f29e90 <+208>:  s_mov_b32 s33, s25
       0x00007ffff5f29e94 <+212>:  s_waitcnt vmcnt(0)
       0x00007ffff5f29e98 <+216>:  s_setpc_b64 s[30:31]
    End of assembler dump.

Between offset +192 and +204, we need to restore the state that was
spilled during the prologue, so we do the same thing: activate all the
lanes, do the load and restore lanes.  When stopping on the buffer_load
instruction, GDB sees that lane 2 is active again, and gets back to
"normal" processing.  We see that we are on a frame different from the
one we had when starting the step, so are about to stop and try to skip
the prologue.  However, this goes quite wrong, as shown by the infrun
logs:

    [infrun] handle_signal_stop: stop_pc=0x7ffff5f29e84
    [infrun] process_event_stop_test: stepped into subroutine
    [infrun] insert_step_resume_breakpoint_at_sal_1: inserting step-resume breakpoint at 0x7ffff5f29e34
    [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [108550.1.1] at 0x7ffff5f29e84

We insert the breakpoint at the end of the function prologue and resume
execution.  But the PC is currently in the epilogue, not in the
prologue!  Because of this, we resume execution without single stepping,
with a breakpoint inserted at an instruction we will not execute, which
allows the wave to run to completion.

I noted earlier that those failures are regressions with a new compiler,
so why didn't we see this issue before?  With the previous version of
the compiler, the bar function (called by foo) also had some spill /
restore logic, changing slightly the sequence.  Because bar was also
spilling, it has an instruction where lane 2 was active, causing GDB to
consider the stop without the lane divergence bypass.

When evaluating if we frame changed, we executed this condition:

    if ((get_stack_frame_id (frame)
         != ecs->event_thread->control.step_stack_frame_id)
        && get_frame_type (frame) != SIGTRAMP_FRAME
        && ((frame_unwind_caller_id (frame)
             == ecs->event_thread->control.step_stack_frame_id)
            && ((ecs->event_thread->control.step_stack_frame_id
                 != outer_frame_id)
                || !ecs->event_thread->control.in_step_start_function (frame))))

Here, the first two predicates hold (current frame is not the same as
when we started stepping, and it is not a SIGTRAMP frame), but things go
south when checking

    frame_unwind_caller_id (frame) == ecs->event_thread->control.step_stack_frame_id)

This one is not true.  Here frame is bar, so the caller is foo, and the
step_stack_frame we compare it to is lane_pc_test.  As a consequence,
this check fails, which gets us to fallback to another case: stepping
got us out of the original step_range we set, but the current
instruction is not a statement either.  So we reset the stepping range
information (set_step_info) and continue single stepping through the
current range.  This gets us all the way through bar (let's ignore the
spill-restore bit), then in foo where we also do the spill-restore.
Because the same check is done, and the current frame (foo) is not
called by where we started (bar), we continue stepping, and eventually
finish the operation normally.

So now the question is: how do we fix this?  I see 2 approaches for
this.  One is to reset the "step range" when we see lane 2 active in foo
the first time.  This would avoid us the incorrect detection of a new
frame when doing the spill-restore.  However, there is something even
more broken exposed with this: why do we allow resuming the wave to skip
the function prologue if we are not in the function prologue?

Because the resume seems universally wrong, this patch proposes to fix
this first.  When doing the resume, check that the thread's PC is not
past the end of the prologue before we use
insert_step_resume_breakpoint_at_sal_1.  With this simple adjustment,
the gdb.rocm/lane-execution.exp testcase passes correctly.

I will explore reseting the stepping range in followup work.

Tested on amdgpu GFX1031.

[1] ROCm/llvm-project@a6fa68a

Change-Id: I7f57c51a430e9a87bddc4877ebddf5deec8a0de8
Bug: LCOMPILER-2337
@lancesix lancesix requested a review from a team as a code owner June 22, 2026 16:26
@aktemur

aktemur commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Function foo is called in an execution path where lane 2 is inactive. The lane becomes active temporarily for spilling/restoring, but that sounds like an implementation detail. Can we not hide these temporary state changes, by noticing that the frame for foo is deeper than the frame in which we started stepping?

@lancesix

Copy link
Copy Markdown
Collaborator Author

Will probably need few discussions for this but:

  • "where stepping started" is kind of a moving target. When we get out of the current stepping range, we end-up resetting the information (every time we end-up calling set_step_info, which can happen multiple times during a step command)
  • We could try to extent the "detect new frame" part by realizing that the current lane was inactive on the call site, and it not expected to meaninfully become active again until we return. Then we should treat that step as a next and continue until we get back to the caller.

That second bit would probably be the most efficient approach to address stepping with divergent lanes. I can look into this.

@aktemur

aktemur commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
  • We could try to extent the "detect new frame" part by realizing that the current lane was inactive on the call site, and it not expected to meaninfully become active again until we return. Then we should treat that step as a next and continue until we get back to the caller.

This makes sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants