gdb/target: revert target_async change#176
Open
simark wants to merge 1 commit into
Open
Conversation
While working a multi-inferior fixes series [1], I tried to apply said
series on the master branch, but the test hits an internal error when
trying to start a second inferior:
/home/smarchi/src/binutils-gdb/gdb/amd-dbgapi-target.c:466: internal-error: async_event_handler_clear: Assertion `amd_dbgapi_async_event_handler != nullptr' failed.
This made me realize that we have this diff between master and
downstream, in the `target_async` function (- is master, + is
amd-staging), which would be needed to avoid the internal error.
@@ -4329,7 +4424,17 @@ target_async (bool enable)
async mode is possible for this target. */
gdb_assert (!enable || target_can_async_p ());
infrun_async (enable);
- current_inferior ()->top_target ()->async (enable);
+
+ process_stratum_target *proc_target = current_inferior ()->process_target ();
+ scoped_restore_current_thread restore_thread;
+
+ for (inferior *inf : all_inferiors (proc_target))
+ {
+ if (current_inferior () != inf)
+ switch_to_inferior_no_thread (inf);
+
+ inf->top_target ()->async (enable);
+ }
}
This change was introduced in the downstream-only commit ff546ec
("gdb/rocm: fix vfork handling"), but now I am not sure it makes sense.
It was meant to handle the following situation. Imagine we have two
inferiors, currently stopped, with the following targets pushed:
- inferior 1, rocm target + linux nat target
- inferior 2, linux nat target
or, visually:
inf 1 inf 2
+------------+
| amd-dbgapi |
+--------------------------+
| linux-nat |
+--------------------------+
When you resume only inf 2, then GDB calls `target_async(true)` on inf
2, which has the effect of making the linux-nat target async-enabled. The
amd-dbgapi target remains async-disabled.
Once an event happens, do_target_wait randomly calls wait on all
inferiors, and it happens that it calls wait on inf 1. This call will
reach amd_dbgapi_target::wait, more particularly this code:
/* Flush the async handler first. */ │
if (target_is_async_p ()) │
async_event_handler_clear ();
The amd-dbgapi target does not implement `target_ops::is_async_p()`
(which is probably a mistake), so the `target_is_async_p()` call hits
linux-nat's implementation (which is in fact `inf_ptrace_target::is_async_p()`).
That returns true, so we call `async_event_handler_clear()`. That
function checks:
gdb_assert (amd_dbgapi_async_event_handler != nullptr);
which is a synonym for "I am currently async-enabled". And that is
false, because the amd-dbgapi remained async-disabled, as mentioned
above.
The `target_async` change we have downstream makes it so that when
calling `target_async(true)` with inf 2 as the current inferior, then we'll
look at all the inferiors sharing its process target (linux-nat) and
make them async-enabled as well. This will avoid the assert, because
it will make the amd-dbgapi async-enabled. But that doesn't seem like a
good fix to today me, because inferior 2 is not running, there is no
good reason to make the amd-dbgapi target async-enabled. That fix
forced the amd-dbgapi target to be async-enabled before calling wait on
it, but that's not a good reason. There is nothing wrong with calling
wait on a target that isn't currently async-enabled.
This patch restores `target_async` to what it is upstream (which alone
would introduce some test failures) and then tries to fix the problem in
a more focused way.
I think the most obvious problem is that the `target_is_async_p()` call
returns true while the amd-dbgapi target is not currently async-enabled,
which leads us on a wrong path.
As a start, this patch changes the conditions that gate the calls to
`async_event_handler_mark` and `async_event_handler_clear` with
(abstracted in a helper function):
amd_dbgapi_async_event_handler != nullptr
instead of
target_is_async_p ()
I think this reflects what we want to know at this very moment: is the
amd-dbgapi target, in isolation, async-enabled.
Further questions:
- Should we give amd_dbgapi_target its own target_ops::is_async_p
implementation, and if so, what should it be? If the amd-dbgapi
target is async-disabled and the underlying target is
async-enabled, what should it return?
- Is there maybe a reason why we don't want to be in a state where
a target in a stack is async-enabled and another is async-disabled?
- Looking at the uses of `target_is_async_p()`, I saw one in
gdb_readline_wrapper_cleanup that could be problematic with target
stacks with async half enabled / half disabled.
gdb_readline_wrapper_cleanup records the result of
`target_is_async_p()` on entry, then calls `target_async(false)`,
then (if async was enabled) restores it on exit with
`target_async(true)`. If we were in a state where amd-dbgapi is
async-disabled and linux-nat is async-enabled, I could see how the
"restored" state might not match the original state.
I don't know why gdb_readline_wrapper_cleanup does that, so I can't
judge what impact it would have to either fail to disable async here,
or if the restored state differs from the original state.
[1] ROCm#106
Contributor
Author
|
This PR is a way to discuss the downstream change we have in function
|
Collaborator
|
Hi, I have not really looked much in the details, but I expect there will be (minor) conflicts with #175. Might need to watch for this. |
Contributor
Author
It rebased cleanly on that PR, built fine and ran a smoke test fine. |
Collaborator
Indeed, my bad. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While working a multi-inferior fixes series [1], I tried to apply said series on the master branch, but the test hits an internal error when trying to start a second inferior:
This made me realize that we have this diff between master and downstream, in the
target_asyncfunction (- is master, + is amd-staging), which would be needed to avoid the internal error.This change was introduced in the downstream-only commit ff546ec ("gdb/rocm: fix vfork handling"), but now I am not sure it makes sense.
It was meant to handle the following situation. Imagine we have two inferiors, currently stopped, with the following targets pushed:
or, visually:
When you resume only inf 2, then GDB calls
target_async(true)on inf 2, which has the effect of making the linux-nat target async-enabled. The amd-dbgapi target remains async-disabled.Once an event happens, do_target_wait randomly calls wait on all inferiors, and it happens that it calls wait on inf 1. This call will reach amd_dbgapi_target::wait, more particularly this code:
The amd-dbgapi target does not implement
target_ops::is_async_p()(which is probably a mistake), so thetarget_is_async_p()call hits linux-nat's implementation (which is in factinf_ptrace_target::is_async_p()). That returns true, so we callasync_event_handler_clear(). That function checks:which is a synonym for "I am currently async-enabled". And that is false, because the amd-dbgapi remained async-disabled, as mentioned above.
The
target_asyncchange we have downstream makes it so that when callingtarget_async(true)with inf 2 as the current inferior, then we'll look at all the inferiors sharing its process target (linux-nat) and make them async-enabled as well. This will avoid the assert, because it will make the amd-dbgapi async-enabled. But that doesn't seem like a good fix to today me, because inferior 2 is not running, there is no good reason to make the amd-dbgapi target async-enabled. That fix forced the amd-dbgapi target to be async-enabled before calling wait on it, but that's not a good reason. There is nothing wrong with calling wait on a target that isn't currently async-enabled.This patch restores
target_asyncto what it is upstream (which alone would introduce some test failures) and then tries to fix the problem in a more focused way.I think the most obvious problem is that the
target_is_async_p()call returns true while the amd-dbgapi target is not currently async-enabled, which leads us on a wrong path.As a start, this patch changes the conditions that gate the calls to
async_event_handler_markandasync_event_handler_clearwith (abstracted in a helper function):instead of
I think this reflects what we want to know at this very moment: is the amd-dbgapi target, in isolation, async-enabled.
Further questions:
Should we give amd_dbgapi_target its own target_ops::is_async_p implementation, and if so, what should it be? If the amd-dbgapi target is async-disabled and the underlying target is async-enabled, what should it return?
Is there maybe a reason why we don't want to be in a state where a target in a stack is async-enabled and another is async-disabled?
Looking at the uses of
target_is_async_p(), I saw one in gdb_readline_wrapper_cleanup that could be problematic with target stacks with async half enabled / half disabled. gdb_readline_wrapper_cleanup records the result oftarget_is_async_p()on entry, then callstarget_async(false), then (if async was enabled) restores it on exit withtarget_async(true). If we were in a state where amd-dbgapi is async-disabled and linux-nat is async-enabled, I could see how the "restored" state might not match the original state.I don't know why gdb_readline_wrapper_cleanup does that, so I can't judge what impact it would have to either fail to disable async here, or if the restored state differs from the original state.
[1] #106