docs: Balanced consolidation policy RFC by jamesmt-aws · Pull Request #2942 · kubernetes-sigs/karpenter

jamesmt-aws · 2026-04-02T06:10:44Z

Summary

A new consolidationPolicy: Balanced that scores each consolidation move by comparing savings and disruption as fractions of NodePool totals. Moves where disruption outweighs savings are rejected.

consolidationThreshold (integer, 1-3, default 2): a move passes when its disruption fraction is at most k times its savings fraction
Per-node disruption cost of 1.0 eliminates division-by-zero edge cases
Score-based ranking replaces disruption-only ranking when budget limits move count
Exhaustive verification across c7i/m7i/r7i confirms k=2 is the smallest value that makes within-family REPLACEs viable

Related issues

aws#8868, aws#8536, aws#6642, aws#7146, #2319, #1019, #735, #1851, #2705, #2883, #1440, #1686, #1430, aws#5218, aws#3577

k8s-ci-robot · 2026-04-02T06:10:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesmt-aws
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2026-04-02T06:10:53Z

The committers listed above are authorized under a signed CLA.

✅ login: jamesmt-aws / name: James Thompson (e593220, a403fee, d89e204)

k8s-ci-robot · 2026-04-02T06:10:55Z

Hi @jamesmt-aws. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jamesmt-aws · 2026-04-02T16:27:56Z

/easycla

designs/balanced-consolidation.md

jukie · 2026-04-02T19:41:49Z

designs/balanced-consolidation.md

+
+This exists in all consolidation modes. The cost threshold concentrates the remaining moves onto higher-impact candidates. The system self-corrects: a nearly-empty replacement scores as a trivial DELETE next cycle. Cascades terminate because each round has strictly fewer displaced nodes.
+
+Configuring kube-scheduler with `MostAllocated` scoring reduces divergence. The [Workload-Aware Scheduling proposal](https://docs.google.com/document/d/1mPYqS4cFmsHPaVQDKyCz7-TKyWNJGjTaZQD3Umkvmgk) (Kepka, Feb 2026) addresses this more directly.


Btw this doc isn't accessible. Presumably it's private or possibly a bad link?

let me ask around for an updated link that I can send you on Kubernetes Slack. I think there was a lot of chatter at KubeCon about the path forward, and I don't really need the link. That will come up in other ways.

coveralls · 2026-04-02T20:39:13Z

Pull Request Test Coverage Report for Build 23920136107

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
34 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.02%) to 80.404%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	2	76.68%
pkg/state/cost/cost.go	32	79.51%

Totals
Change from base Build 23859923955:	-0.02%
Covered Lines:	12178
Relevant Lines:	15146

💛 - Coveralls

ellistarn · 2026-04-03T17:49:52Z

designs/balanced-consolidation.md

@@ -0,0 +1,442 @@
+# Balanced Consolidation: Scoring Moves by Savings and Disruption


I am super excited about this approach!

ellistarn · 2026-04-03T17:51:14Z

designs/balanced-consolidation.md

+spec:
+  disruption:
+    consolidationPolicy: Balanced
+    consolidationThreshold: 2.0


Thoughts on exposing different Enum values that codify these numbers, rather than the number itself? Also -- note that json cannot support floats in a stable way.

good call on the float. the formally motivated values are all integers. k=1 is break-even (deletes only, no replaces). k=2 is where within-family replaces become viable, 4-step max steps to stasis. k=3 adds 8 cross-family pairs with the same 4-step number of steps to stasis. at k=4 churn chains, jump to 9 steps and the formal analysis starts arguing against k=4 (or any higher value). so the natural set is {1, 2, 3} and I'll restrict the input type.

I thought about named presets but Karpenter doesn't have an ordinal enum pattern today, and picking names that age well is hard. "Conservative/Balanced/Aggressive" reuses "Balanced" which is already the policy name. I think the integer is simpler, but we can do whatever you and the rest of the community want to do here.

From an API design perspective, I am not sure that we need both knobs.

Right now, WhenEmpty: k=INF and WhenEmptyOrUnderutilized, k=0.

I see two approaches:

Expand the enum that aliases other K values

Expose a new consolidationThreshold that works when consolidationPolicy: WhenEmptyOrUnderutilized and simply changes the threshold.

cc: @jmdeal, @DerekFrank curious to your thoughts.

Yeah, I think your idea is better. The doc as-written assumes (maybe defensively) that can't justify k=2 uniquely for customers. If that's right, then we need k=3 and k=4, and then we might as well just make this parameter adjust the behavior of WhenEmptyOrUnderutilized. If we can uniquely justify k=2, then we can have a new enum.

I'm leaning towards making k a parameter of WhenEmptyOrUnderutilized based purely on this RFC. @jmdeal and @DerekFrank I'm happy to do whatever you two think is sensible, I'll try to catch up with you you today.

ellistarn · 2026-04-03T17:52:50Z

designs/balanced-consolidation.md

+1. **Pod deletion cost** ([`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost)), divided by 2^27, range -16 to +16. Default 0. The ReplicaSet controller uses this for scale-down ordering; Karpenter reuses it as a disruption signal.
+2. **Pod priority**, divided by 2^25, range -64 to +30. Default 0. Higher-priority pods increase their node's disruption cost.
+
+With neither set, per-pod disruption cost is 1.0. `EvictionCost` clamps to [-10, 10]. The scoring path clamps negative values to 0 via `max(0, EvictionCost(pod))` in the per-node formula (see [NodePool Totals](#nodepool-totals)). Other consumers of `EvictionCost` (eviction ordering) still see negatives. Scoring range per pod: [0, 10].


How does a user communicate that disrupting a pod is free?

if users want their pods to have zero disruption cost, they can set pod-deletion-cost to a large negative value. that drives EvictionCost negative, and the disruption cost in this RFC clamps the result to 0, so the pod contributes nothing to the node's total disruption cost.

the node still has a disruption cost of 1. nodes have a disruption cost independent of their pods (cordoning, draining, API calls, replacement latency). we haven't modeled this precisely and cost=1 is a placeholder. I'll make that clearer in the design. for today it eliminates a divide-by-zero that comes up if node disruption cost is zero. it could be larger if we wanted, but we don't need that yet.

so a node where every pod has negative deletion cost scores the same as an empty node. cheap to disrupt, not free. if a user wants truly zero-friction disruption at a NodePool level, they want WhenEmptyOrUnderutilized (or k=+inf)

Got it -- so in our docs we would recommend setting it to -1 if you want to treat it as free.

A new consolidationPolicy: Balanced that scores each consolidation move by comparing savings and disruption as fractions of NodePool totals. Moves where disruption outweighs savings are rejected. - consolidationThreshold (integer, 1-3, default 2): a move passes when its disruption fraction is at most k times its savings fraction - Per-node disruption cost of 1.0 eliminates division-by-zero edge cases - Score-based ranking replaces disruption-only ranking when budget limits move count - Exhaustive verification across c7i/m7i/r7i confirms k=2 is the smallest value that makes within-family REPLACEs viable

jamesmt-aws · 2026-04-05T23:20:59Z

squashed to fix EasyCLA (removed Co-Authored-By trailer that the bot couldn't resolve). no content changes beyond what was already pushed.

/easycla

Source: kubernetes-sigs#2942 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesmt-aws · 2026-04-07T15:01:27Z

Updated the RFC based on a prototype implementation against the karpenter codebase (12 files, 15 unit tests, all passing). The implementation branch is on my fork if anyone wants to look at it.

Four substantive changes:

Examples corrected. All worked numbers now include the 1.0 per-node disruption cost base defined in the NodePool Totals section. The original examples showed total disruption cost = 80 for 80 pods across 10 nodes when it should be 90 (80 pods + 10 node bases). Scores shift (3.33 becomes 2.81, 1.67 becomes 1.50, etc.) but all approve/reject decisions hold.
evicted_pods clarified. Reschedulable pods only. DaemonSet, mirror, and node-owned pods are excluded since they are not evicted during consolidation and do not represent rescheduling disruption.
Event cardinality made explicit. Single-node moves emit on Node + NodeClaim. Multi-node moves emit a single event on the NodePool. "Must not fan out per-candidate" is now stated directly because the prototype got this wrong on the first pass.
Totals caching guidance. Computing NodePool totals requires knowing each node's pods. At 10k nodes that's O(N) API calls per cycle. The RFC now says: use cached pod mappings, or approximate from candidates only. Candidates-only underestimates the denominator, making scores slightly more conservative, which is the safe direction.

Also: added a brief definition of cycling loops, reframed the churn chain analysis as a formula safety property (on-demand picks cheapest so chains are length 1 in practice; spot has the 15-instance guard), and did an editing pass.

Updated based on a prototype implementation against the karpenter codebase (all tests passing). Four substantive changes: 1. Examples corrected: all worked numbers now include the 1.0 per-node disruption cost base defined in the NodePool Totals section. Scores shift (3.33 becomes 2.81, 1.67 becomes 1.50, etc.) but all approve/reject decisions hold. 2. evicted_pods clarified: reschedulable pods only. DaemonSet, mirror, and node-owned pods are excluded since they are not evicted during consolidation. 3. Event cardinality: single-node moves emit on Node + NodeClaim. Multi-node moves emit one event on NodePool, not per-candidate. 4. Totals caching: guidance on avoiding O(N) API fan-out when computing NodePool totals at scale. Candidates-only approximation is safe (conservative direction). Also: cycling loop definition, churn chain reframe (formula safety property not practical scenario), consolidateAfter wording, Orwell pass.

k8s-ci-robot requested review from engedaam and tallaxes April 2, 2026 06:10

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 2, 2026

jamesmt-aws changed the title ~~designs: Balanced consolidation policy RFC~~ docs: Balanced consolidation policy RFC Apr 2, 2026

jamesmt-aws force-pushed the balanced-consolidation-clean branch from 7b4dff4 to 592c2bd Compare April 2, 2026 16:26

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2026

jukie reviewed Apr 2, 2026

View reviewed changes

designs/balanced-consolidation.md Outdated Show resolved Hide resolved

jukie reviewed Apr 2, 2026

View reviewed changes

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2026

jamesmt-aws mentioned this pull request Apr 3, 2026

feat(disruption): Add Balanced consolidation policy with score-based threshold #2893

Closed

ellistarn reviewed Apr 3, 2026

View reviewed changes

jamesmt-aws force-pushed the balanced-consolidation-clean branch from 6e41cd8 to 63468c3 Compare April 4, 2026 00:09

jamesmt-aws force-pushed the balanced-consolidation-clean branch from 63468c3 to d89e204 Compare April 5, 2026 23:20

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 5, 2026

Merge branch 'kubernetes-sigs:main' into balanced-consolidation-clean

e593220

jamesmt-aws added a commit to jamesmt-aws/karpenter that referenced this pull request Apr 7, 2026

Add balanced consolidation RFC design doc

6cc11f1

Source: kubernetes-sigs#2942 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 7, 2026

jamesmt-aws force-pushed the balanced-consolidation-clean branch from 14a675d to a403fee Compare April 7, 2026 16:42

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 7, 2026


		This exists in all consolidation modes. The cost threshold concentrates the remaining moves onto higher-impact candidates. The system self-corrects: a nearly-empty replacement scores as a trivial DELETE next cycle. Cascades terminate because each round has strictly fewer displaced nodes.

		Configuring kube-scheduler with `MostAllocated` scoring reduces divergence. The [Workload-Aware Scheduling proposal](https://docs.google.com/document/d/1mPYqS4cFmsHPaVQDKyCz7-TKyWNJGjTaZQD3Umkvmgk) (Kepka, Feb 2026) addresses this more directly.

		@@ -0,0 +1,442 @@
		# Balanced Consolidation: Scoring Moves by Savings and Disruption

Conversation

jamesmt-aws commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related issues

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

linux-foundation-easycla bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

jamesmt-aws commented Apr 2, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 2, 2026

Pull Request Test Coverage Report for Build 23920136107

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesmt-aws commented Apr 5, 2026

Uh oh!

jamesmt-aws commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jamesmt-aws commented Apr 2, 2026 •

edited

Loading

linux-foundation-easycla bot commented Apr 2, 2026 •

edited

Loading