Skip to content

gdb/testsuite/gdb.rocm: extract reusable multi-inferior driver helpers#166

Open
spatrang wants to merge 1 commit into
amd-stagingfrom
users/spatrang/multi-inferior-test-helper
Open

gdb/testsuite/gdb.rocm: extract reusable multi-inferior driver helpers#166
spatrang wants to merge 1 commit into
amd-stagingfrom
users/spatrang/multi-inferior-test-helper

Conversation

@spatrang

@spatrang spatrang commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Why this PR

This is preparatory refactoring split out of #131 at reviewer request.
While reviewing #131 (which adds a new multi-inferior stress test), it
was noted that the new test shares most of its driver logic with the
existing gdb.rocm/multi-inferior-gpu.exp. Rather than duplicate that
logic, the common parts are extracted here into shared helpers first, so
that #131 can reuse them and its diff reduces to just what is genuinely
new.

The tests are kept separate (only the driver logic is shared).

Dependent PR

Summary

Extract the shared non-stop multi-inferior driver logic out of
gdb.rocm/multi-inferior-gpu.exp into two reusable helper procs in
gdb/testsuite/lib/rocm.exp, and convert the existing test to use them.

Helpers added (lib/rocm.exp)

  • rocm_multi_inferior_run_to_kernels {args_list expected} — load the
    program, enable non-stop with detach-on-fork off / follow-fork parent, plant the breakpoints, run the parent to its pre-fork
    breakpoint, resume in the background, and collect one kernel
    breakpoint stop per child inferior. Returns the list of stopped GPU
    thread ids. The child count can be passed explicitly or discovered at
    runtime.
  • rocm_multi_inferior_drain {threads} — continue each stopped GPU
    inferior to a clean exit, wait for the parent to reach its
    post-waitpid breakpoint, and run the parent to completion.

Behavior note

The extracted driver is intentionally stricter than the inlined
original: it deduplicates GPU stops by inferior, uses literal-matched
regexes, and fails loudly on timeout or a non-zero child exit instead of
hanging. Coverage of the converted test is otherwise unchanged.

Files changed

  • gdb/testsuite/lib/rocm.exp — add the two helpers.
  • gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp — convert to use them.

@spatrang spatrang requested a review from a team as a code owner June 10, 2026 17:06

@lancesix lancesix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not really thought deeply about it, but the thing which tickles me here is that this helper implicitly relies on properties of the source file (the markers where to insert breakpoints).

If the functions built around those source assumptions are common, I'd expect the source file to be common as well. If we get to a point where we have multiple tests using those helpers, it will get harder to keep the source / tcl bits in sync.

It really feels like the .cpp of multi-inferior-gpu should also be made generic if we go this way.

Comment thread gdb/testsuite/lib/rocm.exp Outdated
@spatrang

Copy link
Copy Markdown
Contributor Author

It really feels like the .cpp of multi-inferior-gpu should also be made generic if we go this way.

Done. Rather than add a separate program, I generalized multi-inferior-gpu.cpp in place so the breakpoint markers the helpers rely on live in a single source: the child count now comes from argv when given and otherwise defaults to the detected GPU device count, and each child re-execs through a child argv dispatch. The follow-up stress test (#131) will point standard_testfile at this same source instead of carrying its own copy, so the .cpp and the .exp helpers stay in sync.

Comment thread gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp Outdated
Comment thread gdb/testsuite/lib/rocm.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp Outdated
@spatrang spatrang force-pushed the users/spatrang/multi-inferior-test-helper branch from f0c9558 to 5a0b205 Compare June 22, 2026 04:42
@spatrang

Copy link
Copy Markdown
Contributor Author

Rebased on the latest amd-staging and force-pushed (f0c95585a0b205). No functional changes; this only updates the base.

Factor the shared non-stop multi-inferior driver logic into two
helper procs and a single driver program, so a follow-up stress
test can reuse them instead of duplicating the logic.

Add multi-inferior.exp.tcl with rocm_multi_inferior_run_to_kernels
(set up the session, run the parent to the pre-fork breakpoint,
resume, and collect one kernel stop per child) and
rocm_multi_inferior_drain (continue each child to a clean exit and
run the parent to completion).  Tests source it via
$srcdir/$subdir/multi-inferior.exp.tcl.

Generalize multi-inferior-gpu.cpp into the shared driver program:
the child count is taken from argv when given and otherwise
defaults to the number of GPU devices found at runtime, and each
child re-execs itself through a "child" argv dispatch.

Convert multi-inferior-gpu.exp to source the helpers and use the
shared program.
@spatrang spatrang force-pushed the users/spatrang/multi-inferior-test-helper branch from 5a0b205 to 4e2f05a Compare June 22, 2026 09:40
@spatrang

Copy link
Copy Markdown
Contributor Author

Updated to address the review and squashed the branch into a single commit (force-pushed).

Changes since the last push:

  • Moved the two driver helpers out of lib/rocm.exp into a new gdb.rocm/multi-inferior.exp.tcl, sourced via $srcdir/$subdir/multi-inferior.exp.tcl; lib/rocm.exp is now unchanged versus the base.
  • Fixed the post-join -> post-waitpid wording in the program header comment.
  • Regrouped the includes (sys/ first, then hip/, then the rest).

gdb.rocm/multi-inferior-gpu.exp still passes (26/0).

@spatrang spatrang assigned lumachad and unassigned czidev-amd Jun 22, 2026
@lumachad

Copy link
Copy Markdown
Collaborator

Any reason why this wasn't kept with @czidev-amd?

@spatrang

Copy link
Copy Markdown
Contributor Author

Any reason why this wasn't kept with @czidev-amd?

Since you’ve already reviewed this PR and have the context, I assumed it would be easier for you to continue as the reviewer.
Happy to switch to auto-assignment if that works better.

@palves

palves commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

I'm worried that this is being pushed as general reusable infrastructure for testing multi-inferior things when we shouldn't be relying on fork for that. See:

#131 (comment)

Usage of fork in tests should be restricted to testing fork-specific things.

@spatrang

Copy link
Copy Markdown
Contributor Author

@palves @lumachad — thanks, that's a fair concern and I'd rather settle the mechanism before building more on top of it.

To make sure I take the right direction, which of these do you prefer for the multi-inferior driver?

  1. add-inferior within a single GDB — no fork in the test program; portable to Windows once GDB there supports multi-process. The helper would set up N inferiors with add-inferior / file / run instead of forking.
  2. One GDB process per inferior, coordinated with spawn_ids — also fork-free and portable, and maps naturally onto the "many independent GPU processes" stress goal.
  3. Both — exercise (1) and (2), if the extra coverage is worth the complexity.

My read of the thread is that (1) is the minimal portable replacement and (2) gives the most realistic stress coverage; happy to start with (1) only if (2) is too much for a first cut.

A couple of things I'd like your steer on:

I'll hold further changes until we agree on the approach.

@lumachad

lumachad commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

I still think this should be sourced as a separate generic/template .tcl file instead of things living in lib/*.exp.

I'm worried that this is being pushed as general reusable infrastructure for testing multi-inferior things when we shouldn't be relying on fork for that. See:

#131 (comment)

Usage of fork in tests should be restricted to testing fork-specific things.

The goal is to expand test coverage for multi-inferior debugging. Right now these tests rely on fork. So it is merely refactoring that code. I think the Windows testsuite run (if/when they run for this case) will still skip these tests due to the use of fork, right?

We should document this is fork-specific/*nix-specific though, if there is a worry someone might misunderstand the restriction.

@lumachad

Copy link
Copy Markdown
Collaborator

If we do want to refactor these fork-based tests into something else not using fork, I think that's fair. But then we need to check that against our extra test coverage efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants