Skip to content

docs: RFC for changing how Node Disruption Budget are defined and functions#2930

Open
GnatorX wants to merge 1 commit intokubernetes-sigs:mainfrom
GnatorX:garvinp-htb-disrupt-guarantee
Open

docs: RFC for changing how Node Disruption Budget are defined and functions#2930
GnatorX wants to merge 1 commit intokubernetes-sigs:mainfrom
GnatorX:garvinp-htb-disrupt-guarantee

Conversation

@GnatorX
Copy link
Copy Markdown
Contributor

@GnatorX GnatorX commented Mar 23, 2026

Fixes #N/A

Description

  • Add design doc for HTB (Hierarchical Token Bucket) based disruption budget model
  • The current budget model is confusing: per reason budgets share a global disrupting
    counter, so they do not behave independently despite the configuration suggesting
    otherwise. "10% for drift" and "5% for consolidation" looks like a partition but
    behaves like a shared pool with soft hints.
  • HTB makes budgets mean what users write: each reason owns its configured budget
    independently, unused capacity flows to siblings via an excess pool, and the catch
    all budget acts as the parent cap.
  • Proposes a standalone DisruptionBudget CRD that NodePools reference, enabling
    cluster wide vs NodePool specific scoping and multi level budget hierarchies
    (cluster > NodePool > reason).
  • Includes walkthroughs for pure isolation (no excess pool) and borrowing (with
    excess pool) scenarios, mapping to existing API, implementation requirements,
    and backward compatibility analysis.

How was this change tested?
Design doc only, no code changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: GnatorX
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from mwielgus March 23, 2026 20:55
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 23, 2026
@k8s-ci-robot k8s-ci-robot requested a review from tallaxes March 23, 2026 20:55
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 23, 2026
@GnatorX GnatorX changed the title RFC for Node Disruption Budget V2 RFC for changing how Node Disruption Budget are defined and functions Mar 23, 2026
@GnatorX GnatorX changed the title RFC for changing how Node Disruption Budget are defined and functions docs: RFC for changing how Node Disruption Budget are defined and functions Mar 23, 2026
@GnatorX
Copy link
Copy Markdown
Contributor Author

GnatorX commented Mar 26, 2026

#994

@nathangeology
Copy link
Copy Markdown
Contributor

Hey Garvin, It was good chatting yesterday. This comment isn't on the specifics of the proposal, but could maybe reframe and simplify this proposal a bit. I was thinking, what if we add an annotation to the nodepool like 'expected-time-to-drift' or 'drift-time-sla' or something like that. You could set 7 days or 30 minutes or whatever there. Then, Karpenter does two things differently from today. One, it rate limits the drift budget such that it will complete a each new drift 'action' in the given SLA. There's some thought that needs to go into how define and track that. We also need to think about what happens to drift if the label is raised or lower mid-drift (I think we want it top speed up or slow down drift post annotation update). Lastly, we want to think about the overall rate limit and how we handle cases where the disruption limits prevent achieving the drift SLA in the given time (probably a warning emitted + respect the limit and complete the drift as fast as possible?). This sort of controls and separates out the drift disruption from the other efficiency based disruptions and gives a (hopefully) clear interface for setting the amount of disruption due to drift that should be happening. The second big change to how Karpenter handles this is that we should consider adjusting the candidate sorting for single/multi-node consolidation to put nodes that are pending a drift action first in the sorting. This makes it so we are more likely to 'knock two birds out with one stone' and boost both efficiency and on-going drifts at the same time. What do you think? Does this make sense as an interface + actions relative the issues you deal with your clusters? I like the idea of aligning the simple-as-possible-interface with the end goal (getting drifts done in a timely fashion without blocking cost saving disruptions too much) instead of making budgets have a lot more nuance and settings to manage. There are still some things we could do from your other proposal on fairness in conjunction with this to ensure that multi/single node disruption isn't totally starved by an aggressive drift too.

nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 1, 2026
…kp-ttx)

Analyze the HTB-based disruption budget model proposed in
kubernetes-sigs#2930 and compare it against an SLO-based
drift budget alternative. Covers root cause analysis of the shared
disrupting counter, evaluation of both approaches, and a phased
recommendation favoring the simpler SLO approach.

sim: kp-ttx
nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 1, 2026
…kp-ttx)

Analyze the HTB-based disruption budget model proposed in
kubernetes-sigs#2930 and compare it against an SLO-based
drift budget alternative. Covers root cause analysis of the shared
disrupting counter, evaluation of both approaches, and a phased
recommendation favoring the simpler SLO approach.

sim: kp-ttx
@GnatorX
Copy link
Copy Markdown
Contributor Author

GnatorX commented Apr 2, 2026

Need some time to digest the full comment since I am currently out for the next 2 weeks.

My initial gut feeling on 'expected-time-to-drift' annotation is I am not a huge fan. I don't like the idea of using annotations to define behaviors and I would rather have a cleaner solution.

I am not sure the current state of cross disruption tracking but given the current single threaded disruption I feel like separating out drift might not do as much good until we have stronger concurrent disruption within Karpenter even though I do like the idea of having separation categories of disruptions that might be able to action independently.

I still think #2927 has a lot of value that solves a fundamental issue with how Karpenter works today without huge rework. I am very open to looking into how we want disruptions to look like.

Will comment once I read through and think through the comment more deeply.

@nathangeology
Copy link
Copy Markdown
Contributor

Need some time to digest the full comment since I am currently out for the next 2 weeks.

My initial gut feeling on 'expected-time-to-drift' annotation is I am not a huge fan. I don't like the idea of using annotations to define behaviors and I would rather have a cleaner solution.

I am not sure the current state of cross disruption tracking but given the current single threaded disruption I feel like separating out drift might not do as much good until we have stronger concurrent disruption within Karpenter even though I do like the idea of having separation categories of disruptions that might be able to action independently.

I still think #2927 has a lot of value that solves a fundamental issue with how Karpenter works today without huge rework. I am very open to looking into how we want disruptions to look like.

Will comment once I read through and think through the comment more deeply.

Thanks, as I've been thinking about this more deeply one question that keeps coming up for me is why we aren't doing more to integrate drift with the efficiency consolidation. Consider the difference if we pushed non-static drift the back of the consolidation stack, but at the same time we changed the candidate sorting to put all the drifted nodes first in line to be considered. This would mean that the drifted nodes with the lowest pod counts would be first in line, followed by drifted nodes with higher pods counts, followed by everything else sorted as it is today. This would give you the chance to find 'two-birds-with-one-stone' consolidations first. Then, after you would still do the current non-static drift disruption logic to keep the ball rolling there and avoid starving out drifts. I still think that saying, "Hey I want my drifts to complete in 7 days or 1 hour (ASAP basically)" ought to be an easy way to communicate the urgency of executing drifts and let you sort of auto-magically parcel out the right amount of disruption budget to execute drifts on time based on time remaining, remaining work, and total available disruptions during the remaining time. I think if you combined that with some changes to help consolidation find consolidations among drift candidates first would go a long way towards avoiding one or the other getting starved. Would we need this and the other PR on fairness in that world? Maybe! We'd have to start actually trying stuff and see how it goes I think. I'll talk with Derek and Jason soon to see what they think about all this too and maybe we can do a call soon. I'll be out all next week so it'll have to wait until the week of the 13th one way or another.

nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 2, 2026
…C (kp-d32)

Analyze the HTB-based disruption budget model proposed in
kubernetes-sigs#2930. Covers strengths, complexity concerns,
simpler alternatives, problem statement assessment, missing
tradeoffs, and specific improvement suggestions.

Key finding: the RFC correctly identifies the budget semantics
mismatch but overbuilds the solution. Simpler cooperative
approaches (drift-priority sorting + SLO annotation) should
be pursued first.

sim: kp-d32
Copy link
Copy Markdown
Contributor

@nathangeology nathangeology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I hope the feedback is useful to you. I'll be out next week so I might slow to respond, but happy to collaborate a bit on this when I get back the week of the 13th.


### Budgets do not read the way they behave

Disruption budgets are configured **per NodePool**, but within each NodePool, the disrupting counter is **shared across all reasons**. `BuildDisruptionBudgetMapping` counts ALL `MarkedForDeletion()` nodes as disrupting regardless of why they were marked. If Drift consumes 10 of 15 allowed disruptions, Consolidation sees only 5 remaining even if its configured budget is 5%. Drift's consumption directly reduces Consolidation's effective budget.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should do something here to fix how confusing this budget definition stuff is!


- **The catch all budget becomes the parent**: it defines the total disruption capacity for the NodePool.
- **Per reason budgets become children**: each owns its slice of disruption capacity independently.
- **Budgets add up the way users expect**: "10% for drift" plus "5% for consolidation" totals 15%. Any difference between that sum and the parent is a shared excess pool available to any reason.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where communicating drift 'urgency' may be helpful because you can say "hey, I have 100 drifted nodes, but 6 days to manage them, maybe I shouldn't burn all of my disruption budget on these drifts right now, but do just enough to be on track to complete them before 6 days is up". Maybe this token budget model is how we do this under-the-hood, but I think as long as the drifts are on track to be completed in a timely fashion we don't need to take up all the budget.

- **`ceil`**: maximum bandwidth including borrowed capacity. A class can burst above its rate by borrowing unused bandwidth from siblings, up to this ceiling.
- **parent**: enforces the global cap across all children.

When a class is not using its full `rate`, the unused portion flows up to the parent and becomes available to sibling classes. A class consuming above its `rate` is borrowing from the shared surplus. The parent's `rate` is the hard ceiling for all children combined.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that unused rate bubbles up here worries me because the consolidation functions are behind the drift ones in the order and could end up getting starved still if the 'rate' looks like it is there when drift runs (when really it just hasn't looked at consolidation opportunities yet).


When a class is not using its full `rate`, the unused portion flows up to the parent and becomes available to sibling classes. A class consuming above its `rate` is borrowing from the shared surplus. The parent's `rate` is the hard ceiling for all children combined.

HTB is **work conserving up to each class's `ceil`**. No capacity sits idle when there is demand for it. If a class has no traffic, its entire `rate` is available to siblings that want to borrow. However, a borrowing class will never exceed its own `ceil`, even if more surplus is available. This means capacity is fully utilized across classes that have work to do, while each class's maximum consumption is still bounded.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses my worry above. This would solve that, but its adding a little bit of complexity for operators to reason about here. How do we communicate to operators when their settings are problematic and how to fix it when that happens? Also, drift and consolidation for efficiency aren't mutually exclusive and we could be doing more to find 'two-birds-with-one-stone' disruptions here.


## Proposed Model

Each reason becomes an HTB class with its own `rate` (the budget the user configured) and a `ceil` equal to the global cap. When a reason is not using its full budget, the unused portion is available to other reasons. This means budgets read as independent slices but unused capacity is not wasted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you imagine would be reasonable defaults here? When and why should people change those?

├── NodePool "batch"
│ rate: 30 nodes # batch workloads get more disruption capacity
│ ceil: 50 nodes # can use full cluster budget if others are idle
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. So the ceil is computed on the allocated rate + the 'excess' auto-magically here?

```
// HTB: per reason with borrowing from excess pool
own_remaining = max(rate[reason] - used[reason], 0)
free_pool = max(excess_pool - pool_used, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely that agree there is value is having some different budget pools instead of just the one global one in the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also helps that the proposed pools are tied to different functions atm so there is some natural scope for each budget to work in.


With a standalone DisruptionBudget CRD, backward compatibility is straightforward. The existing inline NodePool budgets keep their current behavior. HTB semantics only apply when a NodePool references a DisruptionBudget CR. Users opt in explicitly by creating the CR and wiring the reference. No silent semantic changes, no feature flags needed.

## Open Questions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to walk through some alternative ways to that this could be approached or solved, discuss the pros/cons relative to this proposal and give some reasons why this is the better choice. I'd be particularly interested in you describing the simplest possible solution you could imagine to solving the problems you are seeing with starvation and why going beyond that for this proposal is worthwhile. I do think the budget definitions need work such that the interface presented lines up with the system behavior better, its genuinely confusing as things stand and its not easy to diagnose issues like you mentioned you were having with drift starving out consolidation. I would also encourage thinking through ways were we can get better at finding candidates that make the cluster more efficient and keep the drift progressing at the same time. Even fairly efficient nodes slated for drift might make a lot of sense if we can get the disrupted pods onto existing post-drift nodes. Also, we should think through how to organically push new pod placements away from nodes that need to be drifted and replicasets towards terminating pods on nodes slated for drift before nodes that are up-to-date. There's a lot of ways to approach making less work for disruption to do here and to steer the disruptions towards nodes that need to go anyway.


Disruption budgets are configured **per NodePool**, but within each NodePool, the disrupting counter is **shared across all reasons**. `BuildDisruptionBudgetMapping` counts ALL `MarkedForDeletion()` nodes as disrupting regardless of why they were marked. If Drift consumes 10 of 15 allowed disruptions, Consolidation sees only 5 remaining even if its configured budget is 5%. Drift's consumption directly reduces Consolidation's effective budget.

This creates two problems for users trying to reason about their budgets:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like coming at this from the perspective of configuration and understanding. I do think it would be helpful to walk through some examples or cases where bad stuff happens because of the current state of things. In the meeting you talked about how drift can 'starve' out consolidation. I think that would be really good to walk folks through and explain how and why that happens here.

- Drift can exceed its 10-node budget by borrowing from the excess pool when other reasons are not using it.
- Even when Drift borrows all 5 excess nodes, Consolidation still has its configured 5. The budget reads correctly.
- Unused capacity is not wasted. If one reason is idle, others can use the slack. But each reason's configured budget is always respected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a walk through of a toy/example drift starving consolidation case without these changes would help here too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants