Skip to content

bugfix: resolve cross-NUMA spawn worker isolation issues#1776

Open
asr-sheep1 wants to merge 3 commits into
jd-opensource:mainfrom
asr-sheep1:bugfix/numa-spawn-worker
Open

bugfix: resolve cross-NUMA spawn worker isolation issues#1776
asr-sheep1 wants to merge 3 commits into
jd-opensource:mainfrom
asr-sheep1:bugfix/numa-spawn-worker

Conversation

@asr-sheep1

@asr-sheep1 asr-sheep1 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR fixes several issues in the existing cross-NUMA spawn worker implementation on CUDA/MLU/DCU, including:

  • Fix PCI bus_id case mismatch for NUMA detection (observed on CUDA).

  • Initialize spawn_worker_dir for the xllm server binary and add the --spawn_worker_dir option.

  • Reset inherited CPU affinity and NUMA memory policy before binding the spawned worker to the target NUMA node.
    (This change saves the original CPU affinity before engine NUMA binding, clears the memory policy inherited by spawned workers, restores the saved affinity in each spawned worker, and then binds the worker to its target NUMA node.)

These fixes improve the correctness of spawn worker initialization and NUMA binding. While no fatal issues were observed in small-scale model testing, they eliminate several potential issues in cross-NUMA deployments.

Related Issues

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactor
  • Documentation
  • Test
  • Build or CI

Pull Request Checklist

Thank you for contributing to xLLM. Before requesting review, please make sure the following items are complete.

PR Title and Commit Messages

  • The PR title and each commit message follow the xLLM commit format: <type>: <subject>.

Allowed types: feat, bugfix, docs, test, refactor, chore, style, revert, perf, model, build, release.
The subject should use clear English, start with a verb, include at least 4 words, and end with ..

Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit or an equivalent command.
  • I have installed the hooks with pre-commit install.
  • I have run pre-commit run --all-files and fixed any reported issues.

If you are unsure how to set up pre-commit, see the pre-commit documentation.

Self Review

  • I have self-reviewed the code according to .agents/skills/code-review/references/custom-code-style.md, especially code written or assisted by AI.
  • I have rebased this PR onto the latest main branch.

Build and Test Coverage

  • Tests have been added or updated as needed.
  • CUDA: python setup.py build test has passed on a CUDA machine.
  • NPU: python setup.py build test has passed on an NPU machine.
  • MLU: python setup.py build test has passed on an MLU machine.

Reviewer Notes

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new global flag spawn_worker_dir to configure the directory of the spawn_worker binary, and adds reset_inherited_numa_binding() to reset inherited CPU affinity and NUMA memory policies before rebinding. It also ensures PCI bus IDs are converted to lowercase for correct sysfs path matching. Feedback highlights three key issues: a potential failure in sched_setaffinity if numa_num_possible_cpus() returns a non-positive value, a style guide violation where static_cast<char> should be used for the std::tolower conversion, and a path mismatch where the directory spawn_worker_dir is assigned directly to options.spawn_worker_path instead of appending the executable name.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread xllm/core/platform/numa_utils.cpp Outdated
Comment thread xllm/core/platform/numa_utils.cpp Outdated
Comment thread xllm/xllm.cpp
options.draft_devices(draft_devices);
}

options.spawn_worker_path(execution_config.spawn_worker_dir());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a mismatch between spawn_worker_dir (which is defined as a directory) and options.spawn_worker_path (which expects the full path to the executable binary). If spawn_worker_dir is non-empty, we should append /spawn_worker to form the correct executable path; otherwise, passing a directory path directly will cause the spawn operation to fail.

  if (execution_config.spawn_worker_dir().empty()) {
    options.spawn_worker_path("");
  } else {
    options.spawn_worker_path(execution_config.spawn_worker_dir() + "/spawn_worker");
  }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options.spawn_worker_path() is treated as a directory in the subsequent code (see worker_server.cpp:285), where the /spawn_worker suffix is appended automatically.

Although the name is somewhat misleading, its intended semantics are "a directory" rather than "the full executable path". Appending the suffix when assigning the value would cause it to be appended twice, resulting in a startup failure.

Renaming it would require additional changes to the existing variable naming and related code. To minimize the scope of this PR, the existing naming is preserved.

@asr-sheep1 asr-sheep1 changed the title fix: resolve cross-NUMA spawn worker isolation issues bugfix: resolve cross-NUMA spawn worker isolation issues Jun 18, 2026
Comment thread xllm/xllm.cpp
options.draft_devices(draft_devices);
}

options.spawn_worker_path(execution_config.spawn_worker_dir());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spawn worker should only be used in offline inference scenarios (Python interface) and should not appear in xllm.cpp (which is the entry point for online inference). I don't know why this gflags is added.

@asr-sheep1 asr-sheep1 force-pushed the bugfix/numa-spawn-worker branch from 3cc0890 to 641e131 Compare June 23, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants