feat: Add feature-gated pod deletion cost management controller#2894
feat: Add feature-gated pod deletion cost management controller#2894nathangeology wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
Introduce a new controller that manages pod-deletion-cost annotations on pods scheduled to Karpenter-managed nodes. This influences which pods the Kubernetes scheduler prefers to evict during consolidation, enabling smarter disruption decisions. The controller is gated behind the PodDeletionCostManagement feature flag (default: disabled) and supports configurable ranking strategies: - Random: random cost assignment - LargestToSmallest: prefer evicting larger pods first - SmallestToLargest: prefer evicting smaller pods first - UnallocatedVCPUPerPodCost: rank by unallocated vCPU per pod Key components: - controller.go: reconciliation loop watching Karpenter nodes - ranking.go: pluggable ranking strategies for cost calculation - annotation.go: pod annotation management with batch updates - changedetector.go: optimization to skip unchanged node states - events.go/metrics.go: observability for ranking operations Infrastructure changes: - Add PodDeletionCostManagement feature gate to options.go - Add ranking strategy and change detection CLI flags - Register controller conditionally in controllers.go - Add pod update/patch RBAC to kwok clusterrole - Add example env vars to kwok chart values
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nathangeology The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @nathangeology. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Experimental/Draft Do-not-merge
Introduce a new controller that manages pod-deletion-cost annotations on pods scheduled to Karpenter-managed nodes. This influences which pods the Kubernetes scheduler prefers to evict during consolidation, enabling smarter disruption decisions.
The controller is gated behind the PodDeletionCostManagement feature flag (default: disabled) and supports configurable ranking strategies:
Key components:
Infrastructure changes:
Description
This PR adds a new controller that manages controller.kubernetes.io/pod-deletion-cost annotations on pods running on Karpenter-managed nodes. Pod deletion cost is a Kubernetes-native mechanism that influences which pods the ReplicaSet controller prefers to delete during scale-down. By proactively setting these annotations, Karpenter can guide the scheduler toward evicting pods that minimize disruption impact during consolidation. The controller is gated behind a PodDeletionCostManagement feature flag (disabled by default) and supports four pluggable ranking strategies: Random, LargestToSmallest, SmallestToLargest, and UnallocatedVCPUPerPodCost (which ranks pods by how much unallocated CPU capacity exists on their node per pod). A change detection optimization avoids unnecessary annotation updates when node state hasn't changed. The controller watches Karpenter-managed nodes and reconciles pod annotations in batch. Includes RBAC updates for pod update/patch permissions and CLI flags for strategy selection and change detection toggling.
Motivation
When Karpenter consolidates underutilized nodes, it disrupts running pods. For customer workloads, that disruption has a real cost: in-flight requests may fail, warm caches are lost, and replacement pods must re-establish connections and reload state. Depending on the workload, this recovery process can take seconds to minutes (or even hours in some cases). Teams pay for this in terms of latency, availability, and engineering complexity to handle it gracefully. Karpenter already manages clusters toward lower compute cost, but it has limited ability to control how much disruption that right-sizing produces. Current disruption controls focus on preventing disruptions that exceed a given rate for a given nodepool or deployment regardless of the cost-savings merits of that action.
The root cause is a coordination gap between two independent controllers operating on the same cluster. The ReplicaSet controller decides which pods to delete during scale-down. Karpenter's consolidation controller decides which nodes to drain and remove. These two controllers share no information about each other's intent. Without coordination, the ReplicaSet controller uses its default pod selection heuristic during scale-in: prefer pending over ready, respect pod-deletion-cost, spread terminations across nodes, prefer newer pods, then break ties randomly. With no pod-deletion-cost set, the spreading heuristic dominates, and terminations distribute roughly evenly across nodes. This increases the entropy of the cluster: most nodes end up packed at roughly the same density, and often no single node moves meaningfully closer to empty than other nodes. The result is that Karpenter's consolidation controller finds the same unfavorable state after a pod replica scale-in that it found before, with all nodes still occupied, none empty, and utilization distribution roughly unchanged but slightly lower everywhere. This matters especially for ConsolidateWhenEmpty NodePools, where the consolidation policy requires a node to be completely free of pods before it can be removed. If the ReplicaSet controller never concentrates deletions on a single node, that condition is rarely met, and the cluster carries more nodes than necessary.
Why node-level ranking is the right signal
The pod-deletion-cost annotation is the only existing communication channel between Karpenter and the ReplicaSet controller. Karpenter doesn't consolidate pods; it consolidates nodes. When it evaluates a consolidation move, the atomic unit is a node: "can I drain this entire node and either delete it or replace it with something cheaper?" If we ranked pods independently by resource usage, age, or some pod-level heuristic, the ReplicaSet controller might delete a pod from Node A (because that pod scored low individually) while leaving all other pods on Node A intact. That doesn't help Karpenter. Node A still can't be easily consolidated because it still has pods. The "hint" was spent on a decision that doesn't move the system toward an easier-to-consolidate state.
When pods inherit their node's rank, all pods on the same node share the same deletion cost. The ReplicaSet controller's deletion probability becomes uniform within a node but ordered across nodes, exactly matching the structure of Karpenter's consolidation decisions. The practical consequence is a positive feedback loop:
This logic extends naturally to nodes that Karpenter cannot consolidate. Some pods carry the karpenter.sh/do-not-disrupt annotation, which tells Karpenter to leave their node alone entirely. If the ReplicaSet controller deletes pods from one of these nodes during scale-down, that deletion has zero consolidation value because Karpenter can't act on the node regardless. The ranking engine accounts for this by partitioning nodes into two groups before ranking: Group A contains nodes with no do-not-disrupt pods (normal, consolidable nodes), and Group B contains nodes with at least one do-not-disrupt pod (protected nodes). Group A always receives lower deletion costs than Group B, so the ReplicaSet controller removes pods from nodes Karpenter can actually consolidate.
How was this change tested?
Simulation, benchmarking, and make verify
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.