Skip to content

KEP-5726: TopologyManager CPU-attached NUMA filter option#5992

Open
fanzhangio wants to merge 1 commit intokubernetes:masterfrom
fanzhangio:restrict-to-cpu-numa-nodes
Open

KEP-5726: TopologyManager CPU-attached NUMA filter option#5992
fanzhangio wants to merge 1 commit intokubernetes:masterfrom
fanzhangio:restrict-to-cpu-numa-nodes

Conversation

@fanzhangio
Copy link
Copy Markdown

@fanzhangio fanzhangio commented Apr 2, 2026

What type of PR is this?

/kind documentation
/kind cleanup
/kind kep

What this PR does / why we need it:

Adds a draft KEP for an alpha TopologyManager policy option:

  • restrict-to-cpu-numa-nodes

This option is intended to help kubelet handle large NUMA systems that expose many CPU-less NUMA nodes, such as NVIDIA GraceBlackwell / VeraRubin-class platforms.

When enabled, TopologyManager computes an effective NUMA-node set containing only NUMA nodes with CPUs attached and propagates that same view to topology-aware hint providers.

The proposal is intended to address the root issue discussed in:

This proposal also reworks and supersedes the earlier code-only approach discussed in:

Which issue(s) this PR fixes:

Part of kubernetes/kubernetes#135541

Special notes for your reviewer:

Main points of the proposal:

  • behavior is unchanged by default
  • the option is alpha and opt-in
  • the effective NUMA-node set is shared across TopologyManager, CPUManager, MemoryManager, and DeviceManager
  • DeviceManager projects raw device NUMA topology onto the same effective CPU-attached NUMA-node set using NUMA distance information

There is also related broader enhancement discussion in:

This KEP is intentionally scoped more narrowly around a kubelet TopologyManager option.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fanzhangio
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @fanzhangio!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @fanzhangio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2026
@fanzhangio
Copy link
Copy Markdown
Author

fanzhangio commented Apr 2, 2026

@dims @klueska

@dims
Copy link
Copy Markdown
Member

dims commented Apr 2, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2026
Signed-off-by: Fan Zhang <fanzhang@nvidia.com>
@fanzhangio fanzhangio force-pushed the restrict-to-cpu-numa-nodes branch from 855131c to 8b8723b Compare April 2, 2026 15:20
@fanzhangio
Copy link
Copy Markdown
Author

@ffromani Could we add this KEP to the SIG-Node agenda for discussion? I’d be happy to present it in the meeting (or follow up asynchronously over email if that’s easier) Thanks.

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Apr 7, 2026

@ffromani Could we add this KEP to the SIG-Node agenda for discussion? I’d be happy to present it in the meeting (or follow up asynchronously over email if that’s easier) Thanks.

Yes, it is 1.37 material, but the 1.37 cycle did not begun yet (we didn't finish 1.36). That said, you can present earlier to sig-node if you wish to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants