docs: RFC for building a pod deletion cost controller to enable Karpenter to work better with the ReplicaSet Controller#2935
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nathangeology The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @nathangeology. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…n-aware scale-in (kp-1tj) Add proposed updates document addressing review feedback on the pod deletion cost controller RFC: 1. Fill in the ReplicaSet Controller Strategy section (was a TODO) covering short-term pod-deletion-cost approach and long-term ConsolidatingScaleDown KEP (kubernetes/enhancements#5982) 2. New Placement and Spreading Constraints section addressing reviewer concerns about topology spread and scheduling constraints 3. New risk entry for topology spread constraint interaction This is a review-only document for human review before pushing changes to the upstream PR.
…d improvements (kp-drd)
b980166 to
3897e53
Compare
Proposes a feature-gated controller that ranks nodes by consolidation preference (with drifted nodes prioritized) and propagates that ranking to pods via pod-deletion-cost annotations. Three-tier ranking: drifted nodes → normal nodes → do-not-disrupt nodes, each sorted by pod count ascending to follow Karpenter's consolidation candidate sorting.
3897e53 to
eb6c4c2
Compare
Description
Add RFC design document for the Pod Deletion Cost Controller — a new feature-gated controller that reduces voluntary pod disruption rates of replica pods during Karpenter consolidation.
Today, the ReplicaSet controller and Karpenter's consolidation controller operate independently. When a Deployment scales down, the ReplicaSet controller spreads pod deletions evenly across nodes. This keeps all nodes partially occupied and prevents Karpenter from having better consolidation targets — especially painful for ConsolidateWhenEmpty NodePools where a node must be completely free of pods before removal.
This RFC proposes a controller that bridges the gap by setting controller.kubernetes.io/pod-deletion-cost annotations on pods based on their node's consolidation priority. Pods on nodes Karpenter wants to consolidate first get the lowest deletion costs, so the ReplicaSet controller naturally concentrates scale-down deletions on those nodes. The result is a positive feedback loop: targeted nodes drain faster, become empty sooner, and Karpenter consolidates them with less (or zero) disruption.
Key design points:
Feature-gated behind PodDeletionCostManagement (defaults to false), no CRD changes
Nodes partitioned into consolidate-able (Group A) vs do-not-disrupt (Group B) before ranking
SHA-256 change detection skips reconcile cycles when cluster state is unchanged (zero API writes in steady state)
Two-annotation protocol (pod-deletion-cost + karpenter.sh/managed-deletion-cost) ensures customer-set annotations are never overwritten
Pluggable ranking strategies: pod count (recommended), unallocated vCPU per pod, node size, random
New Prometheus metrics and Kubernetes events for observability
Full RFC:
pod-deletion-cost-controller.md
Related Issues
#2123
https://github.com/aws/karpenter/issues/4356
aws/karpenter-provider-aws#3785
aws/karpenter-provider-aws#3927
Does this change impact any existing behavior?
No. This PR adds a design document only. The proposed controller is feature-gated off by default and requires explicit opt-in.
NOTE: Will be proposing changes to the replicaset controller in parallel, will link that KEP here once it's submitted and ready. We only need the changes here or on the replicaset controller (don't need both to see the benefits, but having both doesn't hurt anything either).
Replicaset Controller issue: kubernetes/enhancements#5982