Skip to content

[pjrt] ensure client destruction on process exit#1999

Merged
pilkicTT merged 2 commits into
mainfrom
pilkic/workaround-destroy
Nov 6, 2025
Merged

[pjrt] ensure client destruction on process exit#1999
pilkicTT merged 2 commits into
mainfrom
pilkic/workaround-destroy

Conversation

@pilkicTT

@pilkicTT pilkicTT commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

torch_xla doesn't call PJRT_ClientDestroy properly. This means that we are not closing the devices properly.

Recently, this started causing hangs on n300 boards on subsequent execution of tests.

This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call PJRT_ClientDestroy - like in the case of torch_xla.

Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of SIGSEGV (segmentation fault). For this, ideally we would want a fix on tt-metal side.

Closes #1824

@jameszianxuTT

Copy link
Copy Markdown
Contributor

cc @kmabeeTT

@hshahTT

hshahTT commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

Have we filed an issue with the torch-xla folks about them not calling PJRT_ClientDestroy properly? It's fine to add this WA for now, but it'd be nice to eventually fix the underlying issue within torch-xla itself

@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from 3257cff to fe091fd Compare November 5, 2025 18:32
@jameszianxuTT

Copy link
Copy Markdown
Contributor

@hshahTT torch-xla folks are not a fan of implementing proper destructor logic / calling PJRT Client Destroy pytorch/xla#9675 - so I don't expect to see a fix being upstreamed anytime soon.

Several parties have encountered this issue, not just us.

@github-actions

github-actions Bot commented Nov 5, 2025

Copy link
Copy Markdown
Contributor
TestsPassed ✅Skipped ⚠️Failed
TT-XLA Tests179 ran159 passed20 skipped0 failed
TestResult
No test annotations available

Comment thread pjrt_implementation/inc/api/client_instance.h Outdated
Comment thread pjrt_implementation/src/api/client_instance.cc
@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from d7c948d to 19cd3a7 Compare November 5, 2025 19:47
@kmabeeTT

kmabeeTT commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

Thank you! Works good for me, am able to run back to back pytest cmds on n300-llmbox now without hangs (and needing to call tt-smi -r between tests). I still get DRAM leak between tests in same pytest cmd, but that's a separate issue to figure out.

`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that
we are not closing the devices properly.

Recently, this started causing hangs on `n300` boards on subsequent
execution of tests.

This PR introduces a global singleton object which will ensure that we
are properly destroying the client instance on process shutdown. The
singleton serves as a fallback mechanism if the framework doesn't call
`PJRT_ClientDestroy` - like in the case of `torch_xla`.

Additionally, optimizer sub-meshes are now closed after each
compilation; this previously was needed to avoid hangs, but now it
causes them. Leaving the mechanism of persisting optimizer submesh in
the code base, so that we can play with it if needed. Obviously, we need
to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates
abruptly, e.g. in case of `SIGSEGV` (segmentation fault). For this,
ideally we would want a fix on `tt-metal` side.

Closes #1824
@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from 19cd3a7 to aa67e43 Compare November 6, 2025 12:14
@pilkicTT pilkicTT merged commit 409f079 into main Nov 6, 2025
38 checks passed
@pilkicTT pilkicTT deleted the pilkic/workaround-destroy branch November 6, 2025 14:16
amilovanovicTT pushed a commit that referenced this pull request Dec 2, 2025
`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that
we are not closing the devices properly.

Recently, this started causing hangs on `n300` boards on subsequent
execution of tests.

This PR introduces a global singleton object which will ensure that we
are properly destroying the client instance on process shutdown. The
singleton serves as a fallback mechanism if the framework doesn't call
`PJRT_ClientDestroy` - like in the case of `torch_xla`.

Additionally, optimizer sub-meshes are now closed after each
compilation; this previously was needed to avoid hangs, but now it
causes them. Leaving the mechanism of persisting optimizer submesh in
the code base, so that we can play with it if needed. Obviously, we need
to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates abruptly,
e.g. in case of `SIGSEGV` (segmentation fault). For this, ideally we
would want a fix on `tt-metal` side.

Closes #1824
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch_xla ClientInstance destructor doesn't get called leading to leaked devices and mesh

7 participants