Skip to content

gdb/target: revert target_async change#176

Open
simark wants to merge 1 commit into
ROCm:amd-stagingfrom
simark:revert-target-async
Open

gdb/target: revert target_async change#176
simark wants to merge 1 commit into
ROCm:amd-stagingfrom
simark:revert-target-async

Conversation

@simark

@simark simark commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

While working a multi-inferior fixes series [1], I tried to apply said series on the master branch, but the test hits an internal error when trying to start a second inferior:

/home/smarchi/src/binutils-gdb/gdb/amd-dbgapi-target.c:466: internal-error: async_event_handler_clear: Assertion `amd_dbgapi_async_event_handler != nullptr' failed.

This made me realize that we have this diff between master and downstream, in the target_async function (- is master, + is amd-staging), which would be needed to avoid the internal error.

@@ -4329,7 +4424,17 @@ target_async (bool enable)
      async mode is possible for this target.  */
   gdb_assert (!enable || target_can_async_p ());
   infrun_async (enable);
-  current_inferior ()->top_target ()->async (enable); +
+  process_stratum_target *proc_target = current_inferior ()->process_target ();
+  scoped_restore_current_thread restore_thread; +
+  for (inferior *inf : all_inferiors (proc_target))
+    {
+      if (current_inferior () != inf)
+       switch_to_inferior_no_thread (inf);
+
+      inf->top_target ()->async (enable);
+    }
 }

This change was introduced in the downstream-only commit ff546ec ("gdb/rocm: fix vfork handling"), but now I am not sure it makes sense.

It was meant to handle the following situation. Imagine we have two inferiors, currently stopped, with the following targets pushed:

  • inferior 1, rocm target + linux nat target
  • inferior 2, linux nat target

or, visually:

    inf 1         inf 2
+------------+
| amd-dbgapi |
+--------------------------+
|        linux-nat         |
+--------------------------+

When you resume only inf 2, then GDB calls target_async(true) on inf 2, which has the effect of making the linux-nat target async-enabled. The amd-dbgapi target remains async-disabled.

Once an event happens, do_target_wait randomly calls wait on all inferiors, and it happens that it calls wait on inf 1. This call will reach amd_dbgapi_target::wait, more particularly this code:

/* Flush the async handler first.  */                                                                                                                                                                │
if (target_is_async_p ())                                                                                                                                                                            │
  async_event_handler_clear ();

The amd-dbgapi target does not implement target_ops::is_async_p() (which is probably a mistake), so the target_is_async_p() call hits linux-nat's implementation (which is in fact inf_ptrace_target::is_async_p()). That returns true, so we call async_event_handler_clear(). That function checks:

gdb_assert (amd_dbgapi_async_event_handler != nullptr);

which is a synonym for "I am currently async-enabled". And that is false, because the amd-dbgapi remained async-disabled, as mentioned above.

The target_async change we have downstream makes it so that when calling target_async(true) with inf 2 as the current inferior, then we'll look at all the inferiors sharing its process target (linux-nat) and make them async-enabled as well. This will avoid the assert, because it will make the amd-dbgapi async-enabled. But that doesn't seem like a good fix to today me, because inferior 2 is not running, there is no good reason to make the amd-dbgapi target async-enabled. That fix forced the amd-dbgapi target to be async-enabled before calling wait on it, but that's not a good reason. There is nothing wrong with calling wait on a target that isn't currently async-enabled.

This patch restores target_async to what it is upstream (which alone would introduce some test failures) and then tries to fix the problem in a more focused way.

I think the most obvious problem is that the target_is_async_p() call returns true while the amd-dbgapi target is not currently async-enabled, which leads us on a wrong path.

As a start, this patch changes the conditions that gate the calls to async_event_handler_mark and async_event_handler_clear with (abstracted in a helper function):

amd_dbgapi_async_event_handler != nullptr

instead of

target_is_async_p ()

I think this reflects what we want to know at this very moment: is the amd-dbgapi target, in isolation, async-enabled.

Further questions:

  • Should we give amd_dbgapi_target its own target_ops::is_async_p implementation, and if so, what should it be? If the amd-dbgapi target is async-disabled and the underlying target is async-enabled, what should it return?

  • Is there maybe a reason why we don't want to be in a state where a target in a stack is async-enabled and another is async-disabled?

  • Looking at the uses of target_is_async_p(), I saw one in gdb_readline_wrapper_cleanup that could be problematic with target stacks with async half enabled / half disabled. gdb_readline_wrapper_cleanup records the result of target_is_async_p() on entry, then calls target_async(false), then (if async was enabled) restores it on exit with target_async(true). If we were in a state where amd-dbgapi is async-disabled and linux-nat is async-enabled, I could see how the "restored" state might not match the original state.

    I don't know why gdb_readline_wrapper_cleanup does that, so I can't judge what impact it would have to either fail to disable async here, or if the restored state differs from the original state.

[1] #106

While working a multi-inferior fixes series [1], I tried to apply said
series on the master branch, but the test hits an internal error when
trying to start a second inferior:

    /home/smarchi/src/binutils-gdb/gdb/amd-dbgapi-target.c:466: internal-error: async_event_handler_clear: Assertion `amd_dbgapi_async_event_handler != nullptr' failed.

This made me realize that we have this diff between master and
downstream, in the `target_async` function (- is master, + is
amd-staging), which would be needed to avoid the internal error.

    @@ -4329,7 +4424,17 @@ target_async (bool enable)
          async mode is possible for this target.  */
       gdb_assert (!enable || target_can_async_p ());
       infrun_async (enable);
    -  current_inferior ()->top_target ()->async (enable);
    +
    +  process_stratum_target *proc_target = current_inferior ()->process_target ();
    +  scoped_restore_current_thread restore_thread;
    +
    +  for (inferior *inf : all_inferiors (proc_target))
    +    {
    +      if (current_inferior () != inf)
    +       switch_to_inferior_no_thread (inf);
    +
    +      inf->top_target ()->async (enable);
    +    }
     }

This change was introduced in the downstream-only commit ff546ec
("gdb/rocm: fix vfork handling"), but now I am not sure it makes sense.

It was meant to handle the following situation.  Imagine we have two
inferiors, currently stopped, with the following targets pushed:

 - inferior 1, rocm target + linux nat target
 - inferior 2, linux nat target

or, visually:

        inf 1         inf 2
    +------------+
    | amd-dbgapi |
    +--------------------------+
    |        linux-nat         |
    +--------------------------+

When you resume only inf 2, then GDB calls `target_async(true)` on inf
2, which has the effect of making the linux-nat target async-enabled.  The
amd-dbgapi target remains async-disabled.

Once an event happens, do_target_wait randomly calls wait on all
inferiors, and it happens that it calls wait on inf 1.  This call will
reach amd_dbgapi_target::wait, more particularly this code:

    /* Flush the async handler first.  */                                                                                                                                                                │
    if (target_is_async_p ())                                                                                                                                                                            │
      async_event_handler_clear ();

The amd-dbgapi target does not implement `target_ops::is_async_p()`
(which is probably a mistake), so the `target_is_async_p()` call hits
linux-nat's implementation (which is in fact `inf_ptrace_target::is_async_p()`).
That returns true, so we call `async_event_handler_clear()`.  That
function checks:

    gdb_assert (amd_dbgapi_async_event_handler != nullptr);

which is a synonym for "I am currently async-enabled".  And that is
false, because the amd-dbgapi remained async-disabled, as mentioned
above.

The `target_async` change we have downstream makes it so that when
calling `target_async(true)` with inf 2 as the current inferior, then we'll
look at all the inferiors sharing its process target (linux-nat) and
make them async-enabled as well.  This will avoid the assert, because
it will make the amd-dbgapi async-enabled.  But that doesn't seem like a
good fix to today me, because inferior 2 is not running, there is no
good reason to make the amd-dbgapi target async-enabled.  That fix
forced the amd-dbgapi target to be async-enabled before calling wait on
it, but that's not a good reason.  There is nothing wrong with calling
wait on a target that isn't currently async-enabled.

This patch restores `target_async` to what it is upstream (which alone
would introduce some test failures) and then tries to fix the problem in
a more focused way.

I think the most obvious problem is that the `target_is_async_p()` call
returns true while the amd-dbgapi target is not currently async-enabled,
which leads us on a wrong path.

As a start, this patch changes the conditions that gate the calls to
`async_event_handler_mark` and `async_event_handler_clear` with
(abstracted in a helper function):

    amd_dbgapi_async_event_handler != nullptr

instead of

    target_is_async_p ()

I think this reflects what we want to know at this very moment: is the
amd-dbgapi target, in isolation, async-enabled.

Further questions:

 - Should we give amd_dbgapi_target its own target_ops::is_async_p
   implementation, and if so, what should it be?  If the amd-dbgapi
   target is async-disabled and the underlying target is
   async-enabled, what should it return?

 - Is there maybe a reason why we don't want to be in a state where
   a target in a stack is async-enabled and another is async-disabled?

 - Looking at the uses of `target_is_async_p()`, I saw one in
   gdb_readline_wrapper_cleanup that could be problematic with target
   stacks with async half enabled / half disabled.
   gdb_readline_wrapper_cleanup records the result of
   `target_is_async_p()` on entry, then calls `target_async(false)`,
   then (if async was enabled) restores it on exit with
   `target_async(true)`.  If we were in a state where amd-dbgapi is
   async-disabled and linux-nat is async-enabled, I could see how the
   "restored" state might not match the original state.

   I don't know why gdb_readline_wrapper_cleanup does that, so I can't
   judge what impact it would have to either fail to disable async here,
   or if the restored state differs from the original state.

[1] ROCm#106
@simark simark requested review from lancesix and palves June 17, 2026 20:44
@simark simark requested a review from a team as a code owner June 17, 2026 20:44
@simark

simark commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

This PR is a way to discuss the downstream change we have in function target_async. The outcome will be either:

  • this downstream change is good and we should send it upstream, but then we need a good justification to say why it's needed and why it's correct
  • this downstream change doesn't make sense and we should seek an alternative (this patch provides one)

@lancesix

Copy link
Copy Markdown
Collaborator

Hi, I have not really looked much in the details, but I expect there will be (minor) conflicts with #175. Might need to watch for this.

@simark

simark commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Hi, I have not really looked much in the details, but I expect there will be (minor) conflicts with #175. Might need to watch for this.

It rebased cleanly on that PR, built fine and ran a smoke test fine.

@lancesix

Copy link
Copy Markdown
Collaborator

Hi, I have not really looked much in the details, but I expect there will be (minor) conflicts with #175. Might need to watch for this.

It rebased cleanly on that PR, built fine and ran a smoke test fine.

Indeed, my bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants