Commit 0f67e68
Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL
Summary:
Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:
(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in `portable-api.c` reach into peer threads' ranges via `threads[(num_threads + thread_number - tid) % num_threads]`. When the pool is created with `max_num_threads` (e.g. hardware concurrency) but a parallelize runs with a smaller `num_threads` selected via `pthreadpool_set_num_threads_to_use` (as caffe2/ATen does for mobile inference), every physical worker with `thread_number >= num_threads` is aliased by the `% num_threads` indexing onto an active thread's range, and in the `tid == 0` branch re-reads that range's `range_start` and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own `threads[thread_number]` range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when `thread_number >= num_threads`, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.
(2) Executor-borrowed threads leak `num_active_threads_mutex` and deadlock the pool. In condvar builds (`PTHREADPOOL_USE_FUTEX=0`, the Linux/Android/wasm configuration), `wait_on_num_active_threads` locks `num_active_threads_mutex` while the pool is idle, then an executor-borrowed worker returns `PTHREADPOOL_NUM_ACTIVE_THREADS_DONE` from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's `signal_num_active_threads` (inside `pthreadpool_parallelize`) and every other worker entering `wait_on_num_active_threads`, hanging the pool. This only affects pools created via `pthreadpool_create_v2` with a real executor; the classic `pthreadpool_create` path never takes the branch. Fix: handle executor-borrowed threads in a `noinline cold` helper (`return_thread_to_executor`) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.
(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. `cpu_flash_attention` (`op_sdpa_impl.h`) parallelizes over query blocks via the threadpool, and each block calls `cpublas::gemm` -> `sgemm_`. When the optimized BLAS is OpenMP MKL (the `libblas` variant, `fbsource//third-party/mkl:mkl_lp64_omp`), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global `__kmp_forkjoin_lock` while reaping hidden-helper condvars -- surfacing as `sgr_llm_tests` `LlmTest.TestTextPrefill` intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with `-DET_CPUBLAS_MKL_OMP` exactly when it links OpenMP MKL (`lib_defs.bzl`), gating a `SingleThreadedGemmGuard` -- a thread-local `mkl_set_num_threads_local(1)` for its scope -- constructed at the top of `cpu_flash_attention`'s per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (`op_bmm`/`op_mm`/`op_linear`) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; `LlmTest.TestTextPrefill` passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).
Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.
Reviewed By: jessiezheng123, shoumikhin
Differential Revision: D1082265891 parent 96a64ec commit 0f67e68
4 files changed
Lines changed: 45 additions & 2 deletions
File tree
- extension/llm/custom_ops
- kernels/optimized
- blas
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
805 | 805 | | |
806 | 806 | | |
807 | 807 | | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
808 | 811 | | |
809 | 812 | | |
810 | 813 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
26 | 33 | | |
27 | 34 | | |
28 | 35 | | |
29 | 36 | | |
30 | 37 | | |
31 | 38 | | |
32 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
33 | 53 | | |
34 | 54 | | |
35 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
26 | 43 | | |
27 | 44 | | |
28 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
175 | 175 | | |
176 | 176 | | |
177 | 177 | | |
178 | | - | |
| 178 | + | |
179 | 179 | | |
180 | 180 | | |
181 | 181 | | |
| |||
217 | 217 | | |
218 | 218 | | |
219 | 219 | | |
220 | | - | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
221 | 224 | | |
222 | 225 | | |
223 | 226 | | |
| |||
0 commit comments