Skip to content

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL#20267

Open
GregoryComer wants to merge 1 commit into
pytorch:mainfrom
GregoryComer:export-D108226589
Open

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL#20267
GregoryComer wants to merge 1 commit into
pytorch:mainfrom
GregoryComer:export-D108226589

Conversation

@GregoryComer

Copy link
Copy Markdown
Member

Summary:
Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:

(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in portable-api.c reach into peer threads' ranges via threads[(num_threads + thread_number - tid) % num_threads]. When the pool is created with max_num_threads (e.g. hardware concurrency) but a parallelize runs with a smaller num_threads selected via pthreadpool_set_num_threads_to_use (as caffe2/ATen does for mobile inference), every physical worker with thread_number >= num_threads is aliased by the % num_threads indexing onto an active thread's range, and in the tid == 0 branch re-reads that range's range_start and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own threads[thread_number] range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when thread_number >= num_threads, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.

(2) Executor-borrowed threads leak num_active_threads_mutex and deadlock the pool. In condvar builds (PTHREADPOOL_USE_FUTEX=0, the Linux/Android/wasm configuration), wait_on_num_active_threads locks num_active_threads_mutex while the pool is idle, then an executor-borrowed worker returns PTHREADPOOL_NUM_ACTIVE_THREADS_DONE from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's signal_num_active_threads (inside pthreadpool_parallelize) and every other worker entering wait_on_num_active_threads, hanging the pool. This only affects pools created via pthreadpool_create_v2 with a real executor; the classic pthreadpool_create path never takes the branch. Fix: handle executor-borrowed threads in a noinline cold helper (return_thread_to_executor) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.

(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. cpu_flash_attention (op_sdpa_impl.h) parallelizes over query blocks via the threadpool, and each block calls cpublas::gemm -> sgemm_. When the optimized BLAS is OpenMP MKL (the libblas variant, fbsource//third-party/mkl:mkl_lp64_omp), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global __kmp_forkjoin_lock while reaping hidden-helper condvars -- surfacing as sgr_llm_tests LlmTest.TestTextPrefill intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with -DET_CPUBLAS_MKL_OMP exactly when it links OpenMP MKL (lib_defs.bzl), gating a SingleThreadedGemmGuard -- a thread-local mkl_set_num_threads_local(1) for its scope -- constructed at the top of cpu_flash_attention's per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (op_bmm/op_mm/op_linear) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; LlmTest.TestTextPrefill passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).

Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.

Reviewed By: jessiezheng123, shoumikhin

Differential Revision: D108226589

…orch SDPA under OpenMP MKL

Summary:
Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:

(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in `portable-api.c` reach into peer threads' ranges via `threads[(num_threads + thread_number - tid) % num_threads]`. When the pool is created with `max_num_threads` (e.g. hardware concurrency) but a parallelize runs with a smaller `num_threads` selected via `pthreadpool_set_num_threads_to_use` (as caffe2/ATen does for mobile inference), every physical worker with `thread_number >= num_threads` is aliased by the `% num_threads` indexing onto an active thread's range, and in the `tid == 0` branch re-reads that range's `range_start` and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own `threads[thread_number]` range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when `thread_number >= num_threads`, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.

(2) Executor-borrowed threads leak `num_active_threads_mutex` and deadlock the pool. In condvar builds (`PTHREADPOOL_USE_FUTEX=0`, the Linux/Android/wasm configuration), `wait_on_num_active_threads` locks `num_active_threads_mutex` while the pool is idle, then an executor-borrowed worker returns `PTHREADPOOL_NUM_ACTIVE_THREADS_DONE` from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's `signal_num_active_threads` (inside `pthreadpool_parallelize`) and every other worker entering `wait_on_num_active_threads`, hanging the pool. This only affects pools created via `pthreadpool_create_v2` with a real executor; the classic `pthreadpool_create` path never takes the branch. Fix: handle executor-borrowed threads in a `noinline cold` helper (`return_thread_to_executor`) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.

(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. `cpu_flash_attention` (`op_sdpa_impl.h`) parallelizes over query blocks via the threadpool, and each block calls `cpublas::gemm` -> `sgemm_`. When the optimized BLAS is OpenMP MKL (the `libblas` variant, `fbsource//third-party/mkl:mkl_lp64_omp`), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global `__kmp_forkjoin_lock` while reaping hidden-helper condvars -- surfacing as `sgr_llm_tests` `LlmTest.TestTextPrefill` intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with `-DET_CPUBLAS_MKL_OMP` exactly when it links OpenMP MKL (`lib_defs.bzl`), gating a `SingleThreadedGemmGuard` -- a thread-local `mkl_set_num_threads_local(1)` for its scope -- constructed at the top of `cpu_flash_attention`'s per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (`op_bmm`/`op_mm`/`op_linear`) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; `LlmTest.TestTextPrefill` passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).

Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.

Reviewed By: jessiezheng123, shoumikhin

Differential Revision: D108226589
@pytorch-bot

pytorch-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20267

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 24 New Failures

As of commit 0f67e68 with merge base 96a64ec (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026
@meta-codesync

meta-codesync Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108226589.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant