Skip to content

docs: RFC to introduce node replacement strategies during drift, starting with optionally not requiring replacements#2906

Open
vaietc wants to merge 5 commits intokubernetes-sigs:mainfrom
pinterest:rfcs/terminate-before-create-static-capacity
Open

docs: RFC to introduce node replacement strategies during drift, starting with optionally not requiring replacements#2906
vaietc wants to merge 5 commits intokubernetes-sigs:mainfrom
pinterest:rfcs/terminate-before-create-static-capacity

Conversation

@vaietc
Copy link
Copy Markdown

@vaietc vaietc commented Mar 11, 2026

Fixes #2905

Description

This change adds a small design for supporting different node replacement strategies during drift resolution, starting with not requiring replacements at all. The design attempts to extend the API such that other replacement strategies can be introduced over time.

How was this change tested?

Docs update only so far, no code yet.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 11, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @vaietc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vaietc
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from tallaxes March 11, 2026 07:55
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 11, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 11, 2026
@vaietc
Copy link
Copy Markdown
Author

vaietc commented Mar 11, 2026

@DerekFrank as discussed at the working group meeting, I moved the issue over from the AWS provider to the main Karpenter project. I've attached an early RFC for review and would appreciate any early feedback :)

@DerekFrank
Copy link
Copy Markdown
Collaborator

Summarizing notes from working group:

  • How do we handle budgets for static and dynamic capacity?
  • Are there significant blockers to support dynamic capacity out of the gate?
  • Do we want to be able to set different replacement policies for different types of disruption? (ex: Drift vs Consolidation), and how does that API look?
  • Naming of the policy? Is it more accurately TerminateThenReplace or does DoNotCreateReplacement more accurately capture the setting?

@vaietc
Copy link
Copy Markdown
Author

vaietc commented Apr 6, 2026

@DerekFrank Updated the proposal to help answer the questions raised, summarizing here:

How do we handle budgets for static and dynamic capacity?

I've proposed a small tweak to the current disruption budget calculation to include Terminating and Initializing nodes in the list of total nodes in the NodePool AND in the 'disrupting' Nodes section. I believe this change more accurately reflects the spirit of the disruption budget to track all disrupted nodes, not necessarily just ones disrupted by Karpenter.

Are there significant blockers to support dynamic capacity out of the gate?

Did a quick read of the code and I think the answer is no. Since the disruption budget calculation is the same for both static and dynamic capacity, that change can be made in one common place.

Do we want to be able to set different replacement policies for different types of disruption? (ex: Drift vs Consolidation), and how does that API look?

I think its a bit tricky to reason about for Consolidation since sometimes the decision to consolidate is based on one of the following:

  1. Node is empty (no replacement needed)
  2. Node's pods can be housed in existing nodes (no replacement needed)
  3. A different, lower cost node can house pods on that node (or more than one node in case of multi-node consolidation). In this case, its probably always wise to spin up the replacement node first to avoid longer term disruption for pods.

As a result, I propose we don't touch consolidation for now but we can control it later if we want with changes to the API. For now, I'd like to have spec.disruption.driftResolutionPolicy as described in the pull request and have it only apply to drift.

Naming of the policy? Is it more accurately TerminateThenReplace or does DoNotCreateReplacement more accurately capture the setting?

I renamed to 'Terminate' since that is the only action the disruption controller takes and accurately captures the setting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support disruption policies that allow terminate-before-create for fixed capacity use cases

3 participants