Skip to content

refactor(agent): Pin container memory to NUMA-local node via CpusetMems#11222

Open
rapsealk wants to merge 6 commits intomainfrom
perf/11217-numa-cpuset-mems
Open

refactor(agent): Pin container memory to NUMA-local node via CpusetMems#11222
rapsealk wants to merge 6 commits intomainfrom
perf/11217-numa-cpuset-mems

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 22, 2026

Closes #11217
Refs #11216

Summary

  • Set HostConfig.CpusetMems to the NUMA node ID when all allocated CPU cores are on the same node.
  • Omit CpusetMems for multi-node allocations so the kernel's default NUMA memory placement policy applies.
  • Short-circuit on libnuma.num_nodes() <= 1 so non-NUMA Linux hosts and macOS / WSL / Docker Desktop don't get CpusetMems="0" pinned on every container.
  • Use per-allocated-core libnuma.node_of_cpu() instead of a host-wide list_devices() scan.

Why

Pre-PR the NUMA memory pinning code was commented out, so CPU-only workloads on NUMA hosts could pay 2-3× remote-memory access latency. Part of the agent-runtime performance epic #11216.

Test plan

  • pants check src/ai/backend/agent/docker:: and pants test tests/unit/agent:: pass
  • On a NUMA host, verify container HostConfig.CpusetMems reflects the allocation node
  • On a single-socket / non-NUMA host, verify CpusetMems is absent

Set `CpusetMems` in Docker HostConfig when all allocated CPU cores reside
on a single NUMA node, so container memory is pinned to the same node as
its CPUs. The key is omitted when the allocation spans multiple nodes or
when NUMA info is unavailable on the host.

Closes #11217

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk added this to the 26.5 milestone Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 04:25
@github-actions github-actions Bot added the size:S 10~30 LoC label Apr 22, 2026
@github-actions github-actions Bot added the comp:agent Related to Agent component label Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NUMA-aware memory pinning for CPU-only Docker containers by setting HostConfig.CpusetMems when the allocated CPU cores are all on the same NUMA node, and omitting it otherwise.

Changes:

  • Build HostConfig in CPUPlugin.generate_docker_args() and conditionally set CpusetMems when allocation is NUMA-local.
  • Add a changelog entry for the enhancement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/ai/backend/agent/docker/intrinsic.py Detects allocated cores’ NUMA node(s) and sets HostConfig.CpusetMems only for single-node allocations.
changes/11217.enhance.md Documents the NUMA-local CpusetMems behavior change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +440 to +444
devices = await self.list_devices()
core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
allocated_nodes: set[int] = set()
for core in cores:
node = core_to_node.get(core)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_docker_args() now calls await self.list_devices() just to map allocated core IDs to NUMA nodes. list_devices() iterates all available cores and calls libnuma.node_of_cpu() for each, so this adds per-container startup overhead that scales with host CPU count. Consider computing the NUMA node only for the allocated cores (e.g., call libnuma.node_of_cpu(core) inside the loop), or cache a core_id -> numa_node mapping on the plugin instance and reuse it.

Suggested change
devices = await self.list_devices()
core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
allocated_nodes: set[int] = set()
for core in cores:
node = core_to_node.get(core)
allocated_nodes: set[int] = set()
for core in cores:
node = libnuma.node_of_cpu(core)

Copilot uses AI. Check for mistakes.
Comment on lines +449 to +450
# Pin memory to the NUMA node only when the allocation is fully node-local;
# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says Docker does not expose a multi-node cpuset.mems equivalent via HostConfig, but HostConfig.CpusetMems accepts the same cpuset list/range syntax as cpuset.mems (e.g., "0,1"). If the intent is to only pin when node-local, reword this to reflect that it’s a deliberate policy choice (and why), not a Docker API limitation.

Suggested change
# Pin memory to the NUMA node only when the allocation is fully node-local;
# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.
# Pin memory only when the CPU allocation is fully node-local.
# For multi-node CPU allocations, intentionally leave CpusetMems unset
# so Docker/kernel default NUMA memory placement policy can apply.

Copilot uses AI. Check for mistakes.
Comment on lines +436 to 455
host_config: dict[str, Any] = {
"Cpus": len(cores),
"CpusetCpus": ",".join(sorted_core_ids),
}
devices = await self.list_devices()
core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
allocated_nodes: set[int] = set()
for core in cores:
node = core_to_node.get(core)
if node is None or node < 0:
allocated_nodes.clear()
break
allocated_nodes.add(node)
# Pin memory to the NUMA node only when the allocation is fully node-local;
# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.
if len(allocated_nodes) == 1:
host_config["CpusetMems"] = str(next(iter(allocated_nodes)))
return {
"HostConfig": {
"Cpus": len(cores),
"CpusetCpus": ",".join(sorted_core_ids),
# 'CpusetMems': f'{resource_spec.numa_node}',
},
"HostConfig": host_config,
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces new branching behavior for Docker HostConfig generation (set/omit CpusetMems based on NUMA locality), but there are no unit tests covering it. Add tests that mock CPUPlugin.list_devices() to verify: (1) single-node allocation sets HostConfig.CpusetMems to the node ID, (2) multi-node allocation omits it, and (3) unknown/negative node IDs omit it.

Copilot uses AI. Check for mistakes.
@rapsealk rapsealk changed the title enhance(agent): pin container memory to NUMA-local node via CpusetMems refactor(agent): Pin container memory to NUMA-local node via CpusetMems Apr 22, 2026
rapsealk and others added 2 commits April 22, 2026 14:11
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems
only when the CPU allocation is fully within a single NUMA node, and
omitting the key for multi-node or unknown/negative NUMA mappings.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:L 100~500 LoC and removed size:S 10~30 LoC labels Apr 22, 2026
rapsealk and others added 2 commits April 22, 2026 14:26
- Query libnuma.node_of_cpu() only for allocated cores instead of calling
  self.list_devices(), which iterates every core on the host on each
  container start. Drops per-start overhead from O(host_cores) to
  O(allocated_cores).
- Reword the CpusetMems comment to reflect that omission for multi-node
  allocations is a deliberate policy choice (defer to kernel default NUMA
  placement) rather than a Docker HostConfig API limitation.
- Update the NUMA-locality unit tests to patch libnuma.node_of_cpu
  directly, matching the new implementation seam.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbstone

Addresses review feedback on PR #11222:
- Skip CpusetMems entirely when libnuma reports NUMA unsupported or
  the host has a single node, so non-NUMA Linux hosts and macOS/WSL
  dev environments do not get CpusetMems="0" pinned on every container.
- Add a unit test covering the non-NUMA short-circuit branch.
- Remove the twin commented-out CpusetMems tombstone from
  src/ai/backend/agent/dummy/intrinsic.py.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

Review summary from an independent pass + resolutions pushed to this branch.

Verdict: ship with changes — addressed.

Finding Resolution
libnuma.node_of_cpu returns 0 on non-NUMA/non-Linux hosts → every container would get CpusetMems="0" pinned Short-circuit on libnuma.num_nodes() > 1 before the per-core loop (commit 0eda461ec)
Perf: list_devices() iterates every host core just to build a lookup Switched to per-core libnuma.node_of_cpu(core) (commit 64529b24a)
Comment claimed Docker HostConfig can't express multi-node CpusetMems — factually wrong Reworded as deliberate policy: defer to kernel default NUMA placement for multi-node allocations (commit 64529b24a)
Twin tombstone in dummy/intrinsic.py:148 Removed (commit 0eda461ec)
No unit tests for the new branching 4 tests added: single-node, multi-node, unknown/negative-node, non-NUMA short-circuit

Scope stayed tight: only docker/intrinsic.py, dummy/intrinsic.py, and tests/unit/agent/test_docker_intrinsic.py touched. libnuma.node_of_cpu is a ctypes call into an in-memory lookup, so calling it synchronously inside generate_docker_args is fine per the agent's async rules. Scheduler default is AffinityPolicy.PREFER_SINGLE_NODE, so the common case hits the pin-to-node branch; multi-node correctly falls through to kernel default rather than forcing "0,1" across all nodes.

@rapsealk rapsealk requested review from a team and achimnol April 22, 2026 06:17
- Remove unused local_config + _docker mock from the NUMA-locality
  cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them.
- Replace the case_id parametrize column with pytest's `ids=` kwarg;
  pytest already labels failures via the parametrize id.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +440 to +456
# Skip CpusetMems entirely when NUMA is unsupported (non-Linux hosts,
# Linux without libnuma.so, Docker Desktop, WSL, etc.) or when the host
# exposes a single node; libnuma.node_of_cpu would otherwise fall back
# to 0 and cause every container to be pinned to "CpusetMems": "0".
if libnuma.num_nodes() > 1:
allocated_nodes: set[int] = set()
for core in cores:
node = libnuma.node_of_cpu(core)
if node < 0:
allocated_nodes.clear()
break
allocated_nodes.add(node)
# Pin memory only when the CPU allocation is fully node-local.
# For multi-node CPU allocations, intentionally leave CpusetMems unset
# so Docker/kernel default NUMA memory placement policy can apply.
if len(allocated_nodes) == 1:
host_config["CpusetMems"] = str(next(iter(allocated_nodes)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extracting this logic into a separate private method?


@staticmethod
@contextmanager
def _patch_node_of_cpu(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is intended to reproduce specific test scenarios, it would be more readable to inject fixtures for each scenario and apply parametrization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable CpusetMems when CPU allocation is NUMA-local

3 participants