docs: RFC for changing how Node Disruption Budget are defined and functions by GnatorX · Pull Request #2930 · kubernetes-sigs/karpenter

GnatorX · 2026-03-23T20:55:50Z

Fixes #N/A

Description

Add design doc for HTB (Hierarchical Token Bucket) based disruption budget model
The current budget model is confusing: per reason budgets share a global disrupting
counter, so they do not behave independently despite the configuration suggesting
otherwise. "10% for drift" and "5% for consolidation" looks like a partition but
behaves like a shared pool with soft hints.
HTB makes budgets mean what users write: each reason owns its configured budget
independently, unused capacity flows to siblings via an excess pool, and the catch
all budget acts as the parent cap.
Proposes a standalone DisruptionBudget CRD that NodePools reference, enabling
cluster wide vs NodePool specific scoping and multi level budget hierarchies
(cluster > NodePool > reason).
Includes walkthroughs for pure isolation (no excess pool) and borrowing (with
excess pool) scenarios, mapping to existing API, implementation requirements,
and backward compatibility analysis.

How was this change tested?
Design doc only, no code changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2026-03-23T20:55:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: GnatorX
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

GnatorX · 2026-03-26T16:33:52Z

#994

nathangeology · 2026-03-27T17:16:35Z

Hey Garvin, It was good chatting yesterday. This comment isn't on the specifics of the proposal, but could maybe reframe and simplify this proposal a bit. I was thinking, what if we add an annotation to the nodepool like 'expected-time-to-drift' or 'drift-time-sla' or something like that. You could set 7 days or 30 minutes or whatever there. Then, Karpenter does two things differently from today. One, it rate limits the drift budget such that it will complete a each new drift 'action' in the given SLA. There's some thought that needs to go into how define and track that. We also need to think about what happens to drift if the label is raised or lower mid-drift (I think we want it top speed up or slow down drift post annotation update). Lastly, we want to think about the overall rate limit and how we handle cases where the disruption limits prevent achieving the drift SLA in the given time (probably a warning emitted + respect the limit and complete the drift as fast as possible?). This sort of controls and separates out the drift disruption from the other efficiency based disruptions and gives a (hopefully) clear interface for setting the amount of disruption due to drift that should be happening. The second big change to how Karpenter handles this is that we should consider adjusting the candidate sorting for single/multi-node consolidation to put nodes that are pending a drift action first in the sorting. This makes it so we are more likely to 'knock two birds out with one stone' and boost both efficiency and on-going drifts at the same time. What do you think? Does this make sense as an interface + actions relative the issues you deal with your clusters? I like the idea of aligning the simple-as-possible-interface with the end goal (getting drifts done in a timely fashion without blocking cost saving disruptions too much) instead of making budgets have a lot more nuance and settings to manage. There are still some things we could do from your other proposal on fairness in conjunction with this to ensure that multi/single node disruption isn't totally starved by an aggressive drift too.

…kp-ttx) Analyze the HTB-based disruption budget model proposed in kubernetes-sigs#2930 and compare it against an SLO-based drift budget alternative. Covers root cause analysis of the shared disrupting counter, evaluation of both approaches, and a phased recommendation favoring the simpler SLO approach. sim: kp-ttx

GnatorX · 2026-04-02T12:22:32Z

Need some time to digest the full comment since I am currently out for the next 2 weeks.

My initial gut feeling on 'expected-time-to-drift' annotation is I am not a huge fan. I don't like the idea of using annotations to define behaviors and I would rather have a cleaner solution.

I am not sure the current state of cross disruption tracking but given the current single threaded disruption I feel like separating out drift might not do as much good until we have stronger concurrent disruption within Karpenter even though I do like the idea of having separation categories of disruptions that might be able to action independently.

I still think #2927 has a lot of value that solves a fundamental issue with how Karpenter works today without huge rework. I am very open to looking into how we want disruptions to look like.

Will comment once I read through and think through the comment more deeply.

nathangeology · 2026-04-02T20:30:43Z

Need some time to digest the full comment since I am currently out for the next 2 weeks.

My initial gut feeling on 'expected-time-to-drift' annotation is I am not a huge fan. I don't like the idea of using annotations to define behaviors and I would rather have a cleaner solution.

I am not sure the current state of cross disruption tracking but given the current single threaded disruption I feel like separating out drift might not do as much good until we have stronger concurrent disruption within Karpenter even though I do like the idea of having separation categories of disruptions that might be able to action independently.

I still think #2927 has a lot of value that solves a fundamental issue with how Karpenter works today without huge rework. I am very open to looking into how we want disruptions to look like.

Will comment once I read through and think through the comment more deeply.

Thanks, as I've been thinking about this more deeply one question that keeps coming up for me is why we aren't doing more to integrate drift with the efficiency consolidation. Consider the difference if we pushed non-static drift the back of the consolidation stack, but at the same time we changed the candidate sorting to put all the drifted nodes first in line to be considered. This would mean that the drifted nodes with the lowest pod counts would be first in line, followed by drifted nodes with higher pods counts, followed by everything else sorted as it is today. This would give you the chance to find 'two-birds-with-one-stone' consolidations first. Then, after you would still do the current non-static drift disruption logic to keep the ball rolling there and avoid starving out drifts. I still think that saying, "Hey I want my drifts to complete in 7 days or 1 hour (ASAP basically)" ought to be an easy way to communicate the urgency of executing drifts and let you sort of auto-magically parcel out the right amount of disruption budget to execute drifts on time based on time remaining, remaining work, and total available disruptions during the remaining time. I think if you combined that with some changes to help consolidation find consolidations among drift candidates first would go a long way towards avoiding one or the other getting starved. Would we need this and the other PR on fairness in that world? Maybe! We'd have to start actually trying stuff and see how it goes I think. I'll talk with Derek and Jason soon to see what they think about all this too and maybe we can do a call soon. I'll be out all next week so it'll have to wait until the week of the 13th one way or another.

…C (kp-d32) Analyze the HTB-based disruption budget model proposed in kubernetes-sigs#2930. Covers strengths, complexity concerns, simpler alternatives, problem statement assessment, missing tradeoffs, and specific improvement suggestions. Key finding: the RFC correctly identifies the budget semantics mismatch but overbuilds the solution. Simpler cooperative approaches (drift-priority sorting + SLO annotation) should be pursued first. sim: kp-d32

nathangeology

Thanks for putting this together. I hope the feedback is useful to you. I'll be out next week so I might slow to respond, but happy to collaborate a bit on this when I get back the week of the 13th.

nathangeology · 2026-04-02T20:33:06Z