Skip to content

Commit 7a03f7c

Browse files
authored
fix(python): keep private-loop worker off Python during interpreter exit (#2008)
## What Fixes the remaining red on main after #2007: the `python.yml` **Examples** job crashed with SIGABRT (core dumped) in `langgraph_async_tool.py` at process exit — flaky, ~25% of runs (it passed on the #2007 PR run, failed on the main push run). ## Why Core-dump analysis: the `bashkit-py-loop` private-loop worker thread wakes from `recv()` when the callback engine is gc'd — which commonly happens inside `Py_Finalize`'s GC pass — and called `Python::attach` to close its asyncio event loop. Attaching a fresh thread state during interpreter finalization fatals CPython: ``` Fatal Python error: PyGILState_Release: thread state ... must be current when releasing Python runtime state: finalizing ``` `Python::try_attach` was tried first and does not help: its finalization check is compiled only for Python ≥ 3.13, and `Py_IsInitialized()` still returns 1 during `Py_FinalizeEx`'s GC on older versions (confirmed with a second core dump showing the abort inside `try_attach` itself). This race predates #2007 — it shipped with the private-loop worker redesign (#1918) — but was masked because every `python.yml` run on main since June 6 was cancelled by subsequent pushes. ## How The worker's exit path no longer touches Python at all: the loop's `Py` ref is dropped unattached (pyo3 safely defers the decref) and the loop is closed by asyncio's `BaseEventLoop.__del__` when the deferred decref runs, or reclaimed by the OS at process exit. Documented as TM-PY-030 variant (3) in `specs/threat-model.md`. ## Tests - Before: `langgraph_async_tool.py` aborted 6/30 runs (and 2/5, 1/10 in other rounds). After: **0/80 across two 40-run stress rounds**. - Full bashkit-python pytest suite: 700 passed, 1 skipped. - `cargo fmt --check` / `cargo clippy -p bashkit-python --all-targets` clean. - The Examples CI job itself is the ongoing regression signal for this exit-time race (it runs the crashing example on every PR). --- _Generated by [Claude Code](https://claude.ai/code/session_01DtAsttyBKbB8jQFwzxTw1F)_
1 parent 61080c6 commit 7a03f7c

2 files changed

Lines changed: 20 additions & 6 deletions

File tree

crates/bashkit-python/src/lib.rs

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2249,9 +2249,16 @@ impl PyPrivateAsyncLoop {
22492249
let _ = item.result_tx.send(result);
22502250
}
22512251

2252-
Python::attach(|py| {
2253-
let _ = event_loop.bind(py).call_method0("close");
2254-
});
2252+
// THREAT[TM-PY-030]: do NOT touch Python on the exit path.
2253+
// The worker wakes here because the engine was gc'd, and that
2254+
// gc commonly runs inside Py_Finalize — attaching then crashes
2255+
// CPython (PyGILState_Release fatal: SIGABRT at interpreter
2256+
// exit). Even Python::try_attach cannot detect finalization
2257+
// before 3.13. Dropping `event_loop` without attaching is safe
2258+
// (pyo3 defers the decref); the loop is closed by asyncio's
2259+
// BaseEventLoop.__del__ when the deferred decref runs, or
2260+
// reclaimed by the OS at process exit.
2261+
drop(event_loop);
22552262
})
22562263
.map_err(|e| {
22572264
PyRuntimeError::new_err(format!("failed to spawn private loop thread: {e}"))

specs/threat-model.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1745,7 +1745,7 @@ caller's GIL hold.
17451745

17461746
| TM-PY-026 | reset() discards security config | `BashTool.reset()` creates new `Bash` with bare builder, dropping all configured limits | `PyBash::reset` and `BashTool::reset` rebuild via `replace_live_bash_with_builder` + `build_live_builder`, which preserves the original limits, env, and registered builtins | **MITIGATED** |
17471747
| TM-PY-027 | Unbounded recursion in JSON conversion | `py_to_json`/`json_to_py` recurse without depth limit on nested dicts/lists | `json_to_py_inner`, `py_to_json_inner`, and the MontyObject converters all carry a `depth` arg; depth > `MAX_NESTING_DEPTH = 64` raises `ValueError("… nesting depth exceeds maximum of 64")` | **MITIGATED** |
1748-
| TM-PY-030 | GIL deadlock via async-callback private loop | Private-loop dispatch blocked on a rendezvous channel while attached (GIL held), and pyclass dealloc joined in-flight blocking tasks that must re-attach to finish — either froze the whole process (observed as a 6 h CI hang) | Dispatch detaches around both the send and the receive; `PyRuntime` drop shuts the tokio runtime down with `shutdown_background()` instead of a blocking join | **MITIGATED** |
1748+
| TM-PY-030 | GIL deadlock / exit crash via async-callback private loop | Private-loop dispatch blocked on a rendezvous channel while attached (GIL held); pyclass dealloc joined in-flight blocking tasks that must re-attach to finish (froze the whole process, observed as a 6 h CI hang); worker thread attached during interpreter finalization to close its loop (SIGABRT at process exit) | Dispatch detaches around both the send and the receive; `PyRuntime` drop shuts the tokio runtime down with `shutdown_background()` instead of a blocking join; worker exit path never touches Python (loop closed via `BaseEventLoop.__del__`) | **MITIGATED** |
17491749

17501750
**TM-PY-026** (mitigated): `PyBash::reset` and `BashTool::reset` (`crates/bashkit-python/src/lib.rs`)
17511751
rebuild the inner `Bash` via `replace_live_bash_with_builder` + `build_live_builder`, which
@@ -1767,9 +1767,16 @@ and receive now both run inside `py.detach(...)`. (2) Pyclass dealloc runs attac
17671767
and dropped the last `Arc<Runtime>`; tokio's default `Runtime::drop` joins in-flight
17681768
blocking tasks, and an abandoned (timed-out) callback task must re-attach to finish —
17691769
freezing the entire interpreter. The `PyRuntime` handle now shuts the runtime down
1770-
with `shutdown_background()` on last drop. Regression tests:
1770+
with `shutdown_background()` on last drop. (3) The private-loop worker thread called
1771+
`Python::attach` on its exit path to close its asyncio loop; the worker usually wakes
1772+
because the engine was gc'd, and that gc commonly runs inside `Py_Finalize`
1773+
attaching during finalization fatals CPython (`PyGILState_Release`, SIGABRT at
1774+
interpreter exit; `Python::try_attach` cannot detect finalization before 3.13). The
1775+
worker exit path no longer touches Python: the loop's `Py` ref is dropped unattached
1776+
(deferred decref) and the loop is closed by `BaseEventLoop.__del__`. Regression tests:
17711777
`tests/test_async_callbacks.py::test_async_callback_execute_sync_honors_timeout`,
1772-
`…::test_dealloc_during_inflight_callback_does_not_deadlock`.
1778+
`…::test_dealloc_during_inflight_callback_does_not_deadlock`; variant (3) is covered
1779+
by the `langgraph_async_tool.py` example run in the Python CI Examples job.
17731780

17741781
| TM-PY-029 | Host clock information disclosure | `datetime.date.today()` / `datetime.datetime.now()` expose host system time and timezone | Intentional — required for correct datetime semantics | **ACCEPTED** |
17751782

0 commit comments

Comments
 (0)