bugfix: resolve cross-NUMA spawn worker isolation issues#1776
bugfix: resolve cross-NUMA spawn worker isolation issues#1776asr-sheep1 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new global flag spawn_worker_dir to configure the directory of the spawn_worker binary, and adds reset_inherited_numa_binding() to reset inherited CPU affinity and NUMA memory policies before rebinding. It also ensures PCI bus IDs are converted to lowercase for correct sysfs path matching. Feedback highlights three key issues: a potential failure in sched_setaffinity if numa_num_possible_cpus() returns a non-positive value, a style guide violation where static_cast<char> should be used for the std::tolower conversion, and a path mismatch where the directory spawn_worker_dir is assigned directly to options.spawn_worker_path instead of appending the executable name.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| options.draft_devices(draft_devices); | ||
| } | ||
|
|
||
| options.spawn_worker_path(execution_config.spawn_worker_dir()); |
There was a problem hiding this comment.
There is a mismatch between spawn_worker_dir (which is defined as a directory) and options.spawn_worker_path (which expects the full path to the executable binary). If spawn_worker_dir is non-empty, we should append /spawn_worker to form the correct executable path; otherwise, passing a directory path directly will cause the spawn operation to fail.
if (execution_config.spawn_worker_dir().empty()) {
options.spawn_worker_path("");
} else {
options.spawn_worker_path(execution_config.spawn_worker_dir() + "/spawn_worker");
}There was a problem hiding this comment.
options.spawn_worker_path() is treated as a directory in the subsequent code (see worker_server.cpp:285), where the /spawn_worker suffix is appended automatically.
Although the name is somewhat misleading, its intended semantics are "a directory" rather than "the full executable path". Appending the suffix when assigning the value would cause it to be appended twice, resulting in a startup failure.
Renaming it would require additional changes to the existing variable naming and related code. To minimize the scope of this PR, the existing naming is preserved.
| options.draft_devices(draft_devices); | ||
| } | ||
|
|
||
| options.spawn_worker_path(execution_config.spawn_worker_dir()); |
There was a problem hiding this comment.
The spawn worker should only be used in offline inference scenarios (Python interface) and should not appear in xllm.cpp (which is the entry point for online inference). I don't know why this gflags is added.
3cc0890 to
641e131
Compare
Description
This PR fixes several issues in the existing cross-NUMA spawn worker implementation on CUDA/MLU/DCU, including:
Fix PCI bus_id case mismatch for NUMA detection (observed on CUDA).
Initialize spawn_worker_dir for the xllm server binary and add the --spawn_worker_dir option.
Reset inherited CPU affinity and NUMA memory policy before binding the spawned worker to the target NUMA node.
(This change saves the original CPU affinity before engine NUMA binding, clears the memory policy inherited by spawned workers, restores the saved affinity in each spawned worker, and then binds the worker to its target NUMA node.)
These fixes improve the correctness of spawn worker initialization and NUMA binding. While no fatal issues were observed in small-scale model testing, they eliminate several potential issues in cross-NUMA deployments.
Related Issues
Change Type
Pull Request Checklist
Thank you for contributing to xLLM. Before requesting review, please make sure the following items are complete.
PR Title and Commit Messages
<type>: <subject>.Pre-commit Checks
pre-commitby runningpip install pre-commitor an equivalent command.pre-commit install.pre-commit run --all-filesand fixed any reported issues.Self Review
.agents/skills/code-review/references/custom-code-style.md, especially code written or assisted by AI.mainbranch.Build and Test Coverage
python setup.py build testhas passed on a CUDA machine.python setup.py build testhas passed on an NPU machine.python setup.py build testhas passed on an MLU machine.Reviewer Notes