bugfix: resolve cross-NUMA spawn worker isolation issues by asr-sheep1 · Pull Request #1776 · jd-opensource/xllm

asr-sheep1 · 2026-06-18T03:58:37Z

Description

This PR fixes several issues in the existing cross-NUMA spawn worker implementation on CUDA/MLU/DCU, including:

Fix PCI bus_id case mismatch for NUMA detection (observed on CUDA).
Initialize spawn_worker_dir for the xllm server binary and add the --spawn_worker_dir option.
Reset inherited CPU affinity and NUMA memory policy before binding the spawned worker to the target NUMA node.
(This change saves the original CPU affinity before engine NUMA binding, clears the memory policy inherited by spawned workers, restores the saved affinity in each spawned worker, and then binds the worker to its target NUMA node.)

These fixes improve the correctness of spawn worker initialization and NUMA binding. While no fatal issues were observed in small-scale model testing, they eliminate several potential issues in cross-NUMA deployments.

Related Issues

Change Type

Pull Request Checklist

Thank you for contributing to xLLM. Before requesting review, please make sure the following items are complete.

PR Title and Commit Messages

The PR title and each commit message follow the xLLM commit format: <type>: <subject>.

Allowed types: feat, bugfix, docs, test, refactor, chore, style, revert, perf, model, build, release.
The subject should use clear English, start with a verb, include at least 4 words, and end with ..

Pre-commit Checks

I have installed pre-commit by running pip install pre-commit or an equivalent command.
I have installed the hooks with pre-commit install.
I have run pre-commit run --all-files and fixed any reported issues.

If you are unsure how to set up pre-commit, see the pre-commit documentation.

Self Review

I have self-reviewed the code according to .agents/skills/code-review/references/custom-code-style.md, especially code written or assisted by AI.
I have rebased this PR onto the latest main branch.

Build and Test Coverage

Tests have been added or updated as needed.
CUDA: python setup.py build test has passed on a CUDA machine.
NPU: python setup.py build test has passed on an NPU machine.
MLU: python setup.py build test has passed on an MLU machine.

Reviewer Notes

gemini-code-assist

Code Review

This pull request introduces a new global flag spawn_worker_dir to configure the directory of the spawn_worker binary, and adds reset_inherited_numa_binding() to reset inherited CPU affinity and NUMA memory policies before rebinding. It also ensures PCI bus IDs are converted to lowercase for correct sysfs path matching. Feedback highlights three key issues: a potential failure in sched_setaffinity if numa_num_possible_cpus() returns a non-positive value, a style guide violation where static_cast<char> should be used for the std::tolower conversion, and a path mismatch where the directory spawn_worker_dir is assigned directly to options.spawn_worker_path instead of appending the executable name.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T04:00:14Z

    options.draft_devices(draft_devices);
  }

+  options.spawn_worker_path(execution_config.spawn_worker_dir());


There is a mismatch between spawn_worker_dir (which is defined as a directory) and options.spawn_worker_path (which expects the full path to the executable binary). If spawn_worker_dir is non-empty, we should append /spawn_worker to form the correct executable path; otherwise, passing a directory path directly will cause the spawn operation to fail.

if (execution_config.spawn_worker_dir().empty()) { options.spawn_worker_path(""); } else { options.spawn_worker_path(execution_config.spawn_worker_dir() + "/spawn_worker"); }

options.spawn_worker_path() is treated as a directory in the subsequent code (see worker_server.cpp:285), where the /spawn_worker suffix is appended automatically.

Although the name is somewhat misleading, its intended semantics are "a directory" rather than "the full executable path". Appending the suffix when assigning the value would cause it to be appended twice, resulting in a startup failure.

Renaming it would require additional changes to the existing variable naming and related code. To minimize the scope of this PR, the existing naming is preserved.

weizhehuang0827 · 2026-06-22T01:50:32Z

    options.draft_devices(draft_devices);
  }

+  options.spawn_worker_path(execution_config.spawn_worker_dir());


The spawn worker should only be used in offline inference scenarios (Python interface) and should not appear in xllm.cpp (which is the entry point for online inference). I don't know why this gflags is added.

asr-sheep1 requested review from Clement-Wang26, DongheJin, DragonFive, JimHsiung, Kang-Meng, RobbieLeung, XuZhang99, liujinguang0125, liutongxuan, walsonyang, xiao-yu-chen, yingxudeng, yq33victor and zhang-minchao as code owners June 18, 2026 03:58

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

asr-sheep1 changed the title ~~fix: resolve cross-NUMA spawn worker isolation issues~~ bugfix: resolve cross-NUMA spawn worker isolation issues Jun 18, 2026

asr-sheep1 mentioned this pull request Jun 18, 2026

bugfix: enable spawn worker on dcu device. #1780

Merged

17 tasks

weizhehuang0827 reviewed Jun 22, 2026

View reviewed changes

asr-sheep1 added 3 commits June 23, 2026 10:49

fix: enable cross-NUMA spawn worker isolation.

a71f229

style: cast std::tolower result to char in numa_utils.

1798d6a

bugfix: preserve spawn worker cpu affinity before numa binding

641e131

asr-sheep1 force-pushed the bugfix/numa-spawn-worker branch from 3cc0890 to 641e131 Compare June 23, 2026 02:50

asr-sheep1 requested a review from weizhehuang0827 June 23, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugfix: resolve cross-NUMA spawn worker isolation issues#1776

bugfix: resolve cross-NUMA spawn worker isolation issues#1776
asr-sheep1 wants to merge 3 commits into
jd-opensource:mainfrom
asr-sheep1:bugfix/numa-spawn-worker

asr-sheep1 commented Jun 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

asr-sheep1 Jun 18, 2026

Uh oh!

weizhehuang0827 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

asr-sheep1 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Change Type

Pull Request Checklist

PR Title and Commit Messages

Pre-commit Checks

Self Review

Build and Test Coverage

Reviewer Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

asr-sheep1 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

weizhehuang0827 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asr-sheep1 commented Jun 18, 2026 •

edited

Loading