Skip to content

feat: Add feature-gated pod deletion cost management controller#2894

Draft
nathangeology wants to merge 1 commit intokubernetes-sigs:mainfrom
nathangeology:feat/pod-deletion-cost-management
Draft

feat: Add feature-gated pod deletion cost management controller#2894
nathangeology wants to merge 1 commit intokubernetes-sigs:mainfrom
nathangeology:feat/pod-deletion-cost-management

Conversation

@nathangeology
Copy link
Copy Markdown
Contributor

@nathangeology nathangeology commented Mar 5, 2026

Experimental/Draft Do-not-merge

Introduce a new controller that manages pod-deletion-cost annotations on pods scheduled to Karpenter-managed nodes. This influences which pods the Kubernetes scheduler prefers to evict during consolidation, enabling smarter disruption decisions.

The controller is gated behind the PodDeletionCostManagement feature flag (default: disabled) and supports configurable ranking strategies:

  • Random: random cost assignment
  • LargestToSmallest: prefer evicting larger pods first
  • SmallestToLargest: prefer evicting smaller pods first
  • UnallocatedVCPUPerPodCost: rank by unallocated vCPU per pod

Key components:

  • controller.go: reconciliation loop watching Karpenter nodes
  • ranking.go: pluggable ranking strategies for cost calculation
  • annotation.go: pod annotation management with batch updates
  • changedetector.go: optimization to skip unchanged node states
  • events.go/metrics.go: observability for ranking operations

Infrastructure changes:

  • Add PodDeletionCostManagement feature gate to options.go
  • Add ranking strategy and change detection CLI flags
  • Register controller conditionally in controllers.go
  • Add pod update/patch RBAC to kwok clusterrole
  • Add example env vars to kwok chart values

Description
This PR adds a new controller that manages controller.kubernetes.io/pod-deletion-cost annotations on pods running on Karpenter-managed nodes. Pod deletion cost is a Kubernetes-native mechanism that influences which pods the ReplicaSet controller prefers to delete during scale-down. By proactively setting these annotations, Karpenter can guide the scheduler toward evicting pods that minimize disruption impact during consolidation. The controller is gated behind a PodDeletionCostManagement feature flag (disabled by default) and supports four pluggable ranking strategies: Random, LargestToSmallest, SmallestToLargest, and UnallocatedVCPUPerPodCost (which ranks pods by how much unallocated CPU capacity exists on their node per pod). A change detection optimization avoids unnecessary annotation updates when node state hasn't changed. The controller watches Karpenter-managed nodes and reconciles pod annotations in batch. Includes RBAC updates for pod update/patch permissions and CLI flags for strategy selection and change detection toggling.

Motivation

When Karpenter consolidates underutilized nodes, it disrupts running pods. For customer workloads, that disruption has a real cost: in-flight requests may fail, warm caches are lost, and replacement pods must re-establish connections and reload state. Depending on the workload, this recovery process can take seconds to minutes (or even hours in some cases). Teams pay for this in terms of latency, availability, and engineering complexity to handle it gracefully. Karpenter already manages clusters toward lower compute cost, but it has limited ability to control how much disruption that right-sizing produces. Current disruption controls focus on preventing disruptions that exceed a given rate for a given nodepool or deployment regardless of the cost-savings merits of that action.

The root cause is a coordination gap between two independent controllers operating on the same cluster. The ReplicaSet controller decides which pods to delete during scale-down. Karpenter's consolidation controller decides which nodes to drain and remove. These two controllers share no information about each other's intent. Without coordination, the ReplicaSet controller uses its default pod selection heuristic during scale-in: prefer pending over ready, respect pod-deletion-cost, spread terminations across nodes, prefer newer pods, then break ties randomly. With no pod-deletion-cost set, the spreading heuristic dominates, and terminations distribute roughly evenly across nodes. This increases the entropy of the cluster: most nodes end up packed at roughly the same density, and often no single node moves meaningfully closer to empty than other nodes. The result is that Karpenter's consolidation controller finds the same unfavorable state after a pod replica scale-in that it found before, with all nodes still occupied, none empty, and utilization distribution roughly unchanged but slightly lower everywhere. This matters especially for ConsolidateWhenEmpty NodePools, where the consolidation policy requires a node to be completely free of pods before it can be removed. If the ReplicaSet controller never concentrates deletions on a single node, that condition is rarely met, and the cluster carries more nodes than necessary.

Why node-level ranking is the right signal

The pod-deletion-cost annotation is the only existing communication channel between Karpenter and the ReplicaSet controller. Karpenter doesn't consolidate pods; it consolidates nodes. When it evaluates a consolidation move, the atomic unit is a node: "can I drain this entire node and either delete it or replace it with something cheaper?" If we ranked pods independently by resource usage, age, or some pod-level heuristic, the ReplicaSet controller might delete a pod from Node A (because that pod scored low individually) while leaving all other pods on Node A intact. That doesn't help Karpenter. Node A still can't be easily consolidated because it still has pods. The "hint" was spent on a decision that doesn't move the system toward an easier-to-consolidate state.

When pods inherit their node's rank, all pods on the same node share the same deletion cost. The ReplicaSet controller's deletion probability becomes uniform within a node but ordered across nodes, exactly matching the structure of Karpenter's consolidation decisions. The practical consequence is a positive feedback loop:

Karpenter ranks nodes by consolidation priority 

Pods on those nodes get the lowest deletion costs 

ReplicaSet scale-down removes pods from those nodes first 

Those nodes become empty or closer to empty 

Karpenter can consolidate them with less disruption (or they qualify for WhenEmpty consolidation) 

This logic extends naturally to nodes that Karpenter cannot consolidate. Some pods carry the karpenter.sh/do-not-disrupt annotation, which tells Karpenter to leave their node alone entirely. If the ReplicaSet controller deletes pods from one of these nodes during scale-down, that deletion has zero consolidation value because Karpenter can't act on the node regardless. The ranking engine accounts for this by partitioning nodes into two groups before ranking: Group A contains nodes with no do-not-disrupt pods (normal, consolidable nodes), and Group B contains nodes with at least one do-not-disrupt pod (protected nodes). Group A always receives lower deletion costs than Group B, so the ReplicaSet controller removes pods from nodes Karpenter can actually consolidate.
How was this change tested?
Simulation, benchmarking, and make verify

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Introduce a new controller that manages pod-deletion-cost annotations
on pods scheduled to Karpenter-managed nodes. This influences which
pods the Kubernetes scheduler prefers to evict during consolidation,
enabling smarter disruption decisions.

The controller is gated behind the PodDeletionCostManagement feature
flag (default: disabled) and supports configurable ranking strategies:
- Random: random cost assignment
- LargestToSmallest: prefer evicting larger pods first
- SmallestToLargest: prefer evicting smaller pods first
- UnallocatedVCPUPerPodCost: rank by unallocated vCPU per pod

Key components:
- controller.go: reconciliation loop watching Karpenter nodes
- ranking.go: pluggable ranking strategies for cost calculation
- annotation.go: pod annotation management with batch updates
- changedetector.go: optimization to skip unchanged node states
- events.go/metrics.go: observability for ranking operations

Infrastructure changes:
- Add PodDeletionCostManagement feature gate to options.go
- Add ranking strategy and change detection CLI flags
- Register controller conditionally in controllers.go
- Add pod update/patch RBAC to kwok clusterrole
- Add example env vars to kwok chart values
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nathangeology
Once this PR has been reviewed and has the lgtm label, please assign jmdeal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @nathangeology. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants