Skip to content

Commit 8fb2856

Browse files
authored
feat(autoscaling): CPURequestsRemoveLimitsMemoryRequestsAndLimits + cluster burstable default (#49314)
## Summary ### 1. `CPURequestsRemoveLimitsMemoryRequestsAndLimits` controlled value Implements the new `CPURequestsRemoveLimitsMemoryRequestsAndLimits` container `controlledValues` enum from the datadog-operator ([c59fc90](DataDog/datadog-operator@c59fc90)). When set on a container constraint: - **CPU**: only requests are controlled; any existing CPU limit is actively removed from the live pod, letting the container burst freely and avoid CPU throttling. - **Memory**: both requests and limits are controlled (standard `RequestsAndLimits` semantics). The removal flows via a sentinel quantity (`-1`) stored on `ContainerResources.Limits[cpu]` by `applyVerticalConstraints` and recognised by the pod patcher, which actively deletes any pre-existing CPU limit from the live pod (an absent entry would mean "leave untouched"). ### 2. Recognise `spec.options.burstable` in `IsBurstable()` `IsBurstable()` now resolves from `spec.options.burstable` first, then the existing preview annotation. There is no cluster-level default — burstable mode requires an explicit per-DPA opt-in via either signal. ### 3. Report effective burstable mode in `status.options.burstable` `BuildStatus` populates `status.options.burstable` so consumers can observe the resolved value: - When `spec.options.burstable` is set: reported as-is (true or false). - When the preview annotation enables burstable: reported as `true`. - Otherwise: omitted (no opt-in). ### 4. Fix stale-sentinel propagation when the backend stops emitting a vertical recommendation `SetActiveScalingValues` previously self-assigned `p.scalingValues.Vertical` when the active vertical source was `nil` (because `selectScalingValues(nil)` returns `p.scalingValues`). This kept any burstable sentinel from the previous cycle alive and caused `applyVerticalConstraints(burstable=false)` to early-return, suppressing the rollout when burstable was toggled off. Now `Vertical` is explicitly reset to `nil` so the next sync sees "no recommendation" and clears live state. ### 5. Dependency bump `go.mod`: `datadog-operator/api` → `v0.0.0-20260515125012-8e158b708444`. ## Test plan - [x] `TestApplyVerticalConstraints_CPURequestsRemoveLimits` — CPU limit stripped from recommendation, memory limit preserved, sentinel inserted. - [x] `TestPatchContainerResources_*` / `TestPatchPod_*` — CPU limit actively removed from live pod, idempotent when already absent. - [x] `IsBurstable priority matrix` — spec > annotation > default false. - [x] `BuildStatus` reports `options.burstable` for spec (true/false) and for annotation-enabled, omits when neither is set. - [x] `TestSetActiveScalingValues_NilSource_ClearsVertical` — nil vertical source clears `scalingValues.Vertical` instead of self-assigning the stale sentinel. - [x] Full package suite passes: `dda inv test --targets=./pkg/clusteragent/autoscaling/workload/...` - [x] Lint clean: `dda inv linter.go --targets ./pkg/clusteragent/autoscaling/workload/...` - [x] Manual QA in kind cluster: DPA with preview annotation reports `status.options.burstable: true`; DPA without explicit opt-in reports no options (defaults to non-burstable). 🤖 PR description and code assisted by Claude:claude-opus-4-7 Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
1 parent 48d67af commit 8fb2856

10 files changed

Lines changed: 509 additions & 121 deletions

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ require (
164164
github.com/DataDog/datadog-agent/pkg/util/winutil v0.78.1
165165
github.com/DataDog/datadog-agent/pkg/version v0.78.1
166166
github.com/DataDog/datadog-go/v5 v5.8.3
167-
github.com/DataDog/datadog-operator/api v0.0.0-20260505153817-d1d2b65124cc
167+
github.com/DataDog/datadog-operator/api v0.0.0-20260515125012-8e158b708444
168168
github.com/DataDog/datadog-traceroute v1.0.15
169169
github.com/DataDog/dd-trace-go/v2 v2.8.2
170170
github.com/DataDog/ebpf-manager v0.7.18

go.sum

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/clusteragent/autoscaling/workload/controller_test.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1520,6 +1520,9 @@ func TestProfileManagedDPA(t *testing.T) {
15201520
condition(datadoghqcommon.DatadogPodAutoscalerHorizontalAbleToScaleCondition, corev1.ConditionUnknown, "", "", testTime),
15211521
condition(datadoghqcommon.DatadogPodAutoscalerVerticalAbleToApply, corev1.ConditionUnknown, "", "", testTime),
15221522
},
1523+
Options: &datadoghqcommon.DatadogPodAutoscalerOptionsStatus{
1524+
Burstable: pointer.Ptr(true),
1525+
},
15231526
},
15241527
}
15251528
f.ExpectCreateAction(mustUnstructured(t, expectedDPA))

pkg/clusteragent/autoscaling/workload/controller_vertical_helpers.go

Lines changed: 90 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -42,10 +42,9 @@ const (
4242

4343
const inPlaceResizeSupportedCacheTTL = 15 * time.Minute
4444

45-
// removeLimitSentinel is stored in a container Limits map by applyVerticalConstraints (burstable
46-
// mode) to signal "delete this limit from the pod instead of setting it to this value".
47-
// Negative quantities are never valid as real Kubernetes resource values, making the intent
48-
// unambiguous and easy to identify with quantity.Sign() < 0.
45+
// removeLimitSentinel is a sentinel quantity stored in ContainerResources.Limits to
46+
// signal that an existing limit must be actively deleted from the live pod. Negative
47+
// quantities are never valid Kubernetes resource values, making the intent unambiguous.
4948
var removeLimitSentinel = resource.MustParse("-1")
5049

5150
// isInPlaceResizeSupported checks whether the API server exposes the pods/resize
@@ -336,9 +335,10 @@ func hasLimitIncrease(
336335
}
337336

338337
// applyVerticalConstraints applies the container constraints from the PodAutoscaler spec to the
339-
// recommendations, and in burstable mode stores removeLimitSentinel (-1) on the CPU limit of every
340-
// container so that:
341-
// - the ResourcesHash changes when burstable mode is toggled (triggering pod re-patches)
338+
// recommendations. When the CPU limit must be removed from the live pod — autoscaler-wide in
339+
// burstable mode, or per-container when ControlledValues is CPURequestsRemoveLimitsMemoryRequestsAndLimits —
340+
// it stores removeLimitSentinel (-1) on that container's CPU limit so that:
341+
// - the ResourcesHash changes when the removal is toggled (triggering pod re-patches)
342342
// - hasLimitIncrease sees cpuLimit.Sign() <= 0 and correctly identifies the transition as a limit increase
343343
// - patchContainerResources removes the CPU limit from the pod when it encounters the sentinel
344344
func applyVerticalConstraints(verticalRecs *model.VerticalScalingValues, constraints *datadoghqcommon.DatadogPodAutoscalerConstraints, burstable bool) (limitErr, err error) {
@@ -380,90 +380,101 @@ func applyVerticalConstraints(verticalRecs *model.VerticalScalingValues, constra
380380
if !found {
381381
constraint = wildcardConstraint
382382
}
383-
if constraint == nil {
384-
kept = append(kept, cr)
385-
continue
386-
}
387-
388-
// Enabled=false: drop this container's recommendations entirely
389-
if constraint.Enabled != nil && !*constraint.Enabled {
390-
modified = true
391-
continue
392-
}
393383

394-
// Resolve which resources are controlled.
395-
// nil defaults to [cpu, memory]; empty list is equivalent to Enabled=false.
396-
controlled := constraint.ControlledResources
397-
if controlled == nil {
398-
controlled = []corev1.ResourceName{corev1.ResourceCPU, corev1.ResourceMemory}
399-
}
400-
if len(controlled) == 0 {
401-
modified = true
402-
continue
403-
}
384+
// removeCPULimit is true when the CPU limit must be deleted from the live pod.
385+
// It applies autoscaler-wide in burstable mode, and per-container when the
386+
// constraint's ControlledValues requests CPU-limit removal (resolved below).
387+
// The actual stamping happens once, after constraint processing.
388+
removeCPULimit := burstable
404389

405-
// Remove resources not in the controlled list from requests and limits
406-
for name := range cr.Requests {
407-
if !slices.Contains(controlled, name) {
408-
delete(cr.Requests, name)
390+
if constraint != nil {
391+
// Enabled=false: drop this container's recommendations entirely
392+
if constraint.Enabled != nil && !*constraint.Enabled {
409393
modified = true
394+
continue
410395
}
411-
}
412-
for name := range cr.Limits {
413-
if !slices.Contains(controlled, name) {
414-
delete(cr.Limits, name)
396+
397+
// Resolve which resources are controlled.
398+
// nil defaults to [cpu, memory]; empty list is equivalent to Enabled=false.
399+
controlled := constraint.ControlledResources
400+
if controlled == nil {
401+
controlled = []corev1.ResourceName{corev1.ResourceCPU, corev1.ResourceMemory}
402+
}
403+
if len(controlled) == 0 {
415404
modified = true
405+
continue
416406
}
417-
}
418407

419-
// ControlledValues=RequestsOnly: strip all limits
420-
if constraint.ControlledValues != nil && *constraint.ControlledValues == datadoghqcommon.DatadogPodAutoscalerContainerControlledValuesRequestsOnly {
421-
if len(cr.Limits) > 0 {
422-
cr.Limits = nil
423-
modified = true
408+
// Remove resources not in the controlled list from requests and limits
409+
for name := range cr.Requests {
410+
if !slices.Contains(controlled, name) {
411+
delete(cr.Requests, name)
412+
modified = true
413+
}
414+
}
415+
for name := range cr.Limits {
416+
if !slices.Contains(controlled, name) {
417+
delete(cr.Limits, name)
418+
modified = true
419+
}
424420
}
425-
}
426421

427-
// Resolve min/max bounds for clamping.
428-
// New top-level MinAllowed/MaxAllowed apply to both requests and limits.
429-
// Deprecated Requests.MinAllowed/MaxAllowed apply to requests only.
430-
reqMin, reqMax, limMin, limMax := resolveMinMaxBounds(constraint)
431-
432-
// Clamp existing requests and limits to their respective bounds.
433-
// Track which containers were clamped for the VerticalScalingLimited condition.
434-
requestsClamped := clampResourceList(cr.Requests, reqMin, reqMax)
435-
limitsClamped := clampResourceList(cr.Limits, limMin, limMax)
436-
if requestsClamped || limitsClamped {
437-
clampedContainers = append(clampedContainers, cr.Name)
438-
modified = true
439-
}
422+
// ControlledValues=RequestsOnly: strip all limits
423+
if constraint.ControlledValues != nil && *constraint.ControlledValues == datadoghqcommon.DatadogPodAutoscalerContainerControlledValuesRequestsOnly {
424+
if len(cr.Limits) > 0 {
425+
cr.Limits = nil
426+
modified = true
427+
}
428+
}
429+
430+
// ControlledValues=CPURequestsRemoveLimitsMemoryRequestsAndLimits: like burstable,
431+
// the CPU limit must be removed from the live pod (handled by the shared stamping below).
432+
if constraint.ControlledValues != nil &&
433+
*constraint.ControlledValues == datadoghqcommon.DatadogPodAutoscalerContainerControlledValuesCPURequestsRemoveLimitsMemoryRequestsAndLimits {
434+
removeCPULimit = true
435+
}
440436

441-
// Maintain invariant: limits >= requests for all resources where both exist
442-
for resourceName, reqQty := range cr.Requests {
443-
if limQty, hasLimit := cr.Limits[resourceName]; hasLimit && limQty.Cmp(reqQty) < 0 {
444-
cr.Limits[resourceName] = reqQty.DeepCopy()
437+
// Resolve min/max bounds for clamping.
438+
// New top-level MinAllowed/MaxAllowed apply to both requests and limits.
439+
// Deprecated Requests.MinAllowed/MaxAllowed apply to requests only.
440+
reqMin, reqMax, limMin, limMax := resolveMinMaxBounds(constraint)
441+
442+
// Clamp existing requests and limits to their respective bounds.
443+
// Track which containers were clamped for the VerticalScalingLimited condition.
444+
requestsClamped := clampResourceList(cr.Requests, reqMin, reqMax)
445+
limitsClamped := clampResourceList(cr.Limits, limMin, limMax)
446+
if requestsClamped || limitsClamped {
447+
clampedContainers = append(clampedContainers, cr.Name)
445448
modified = true
446449
}
447-
}
448450

449-
kept = append(kept, cr)
450-
}
451+
// Maintain invariant: limits >= requests for all resources where both exist.
452+
// Skip sentinel limits (negative) — they are internal markers, not values.
453+
for resourceName, reqQty := range cr.Requests {
454+
if limQty, hasLimit := cr.Limits[resourceName]; hasLimit && limQty.Sign() >= 0 && limQty.Cmp(reqQty) < 0 {
455+
cr.Limits[resourceName] = reqQty.DeepCopy()
456+
modified = true
457+
}
458+
}
459+
}
451460

452-
// In burstable mode, stamp the CPU limit to removeLimitSentinel (-1) on every container.
453-
// This has three effects:
454-
// 1. The ResourcesHash changes when burstable is toggled (pods get re-patched).
455-
// 2. hasLimitIncrease sees cpuLimit.Sign() <= 0, correctly identifying non-burstable →
456-
// burstable as a limit increase (unlimited > any finite limit).
457-
// 3. patchContainerResources sees Sign() < 0 and removes the CPU limit from the running
458-
// pod instead of setting it.
459-
if burstable {
460-
for i := range kept {
461-
if kept[i].Limits == nil {
462-
kept[i].Limits = corev1.ResourceList{}
461+
// Stamp the CPU limit with removeLimitSentinel (-1) when it must be removed from the
462+
// live pod. Done after clamping/invariant so the sentinel value is never altered (both
463+
// skip negative quantities). This has three effects:
464+
// 1. The ResourcesHash changes when removal is toggled (pods get re-patched).
465+
// 2. hasLimitIncrease sees cpuLimit.Sign() <= 0, correctly identifying the transition
466+
// to "no limit" as a limit increase (unlimited > any finite limit).
467+
// 3. patchContainerResources sees Sign() < 0 and removes the CPU limit from the running
468+
// pod instead of setting it (an absent entry would mean "leave untouched").
469+
if removeCPULimit {
470+
if cr.Limits == nil {
471+
cr.Limits = corev1.ResourceList{}
463472
}
464-
kept[i].Limits[corev1.ResourceCPU] = removeLimitSentinel.DeepCopy()
473+
cr.Limits[corev1.ResourceCPU] = removeLimitSentinel.DeepCopy()
465474
modified = true
466475
}
476+
477+
kept = append(kept, cr)
467478
}
468479

469480
verticalRecs.ContainerResources = kept
@@ -507,12 +518,17 @@ func resolveMinMaxBounds(c *datadoghqcommon.DatadogPodAutoscalerContainerConstra
507518

508519
// clampResourceList clamps each resource quantity in the list to [min, max].
509520
// Returns true if any values were modified.
521+
// removeLimitSentinel entries are skipped: they are internal markers for downstream
522+
// consumers (signalling "delete this limit from the pod") and must survive clamping.
510523
func clampResourceList(rl corev1.ResourceList, minAllowed, maxAllowed corev1.ResourceList) bool {
511524
if rl == nil {
512525
return false
513526
}
514527
modified := false
515528
for name, qty := range rl {
529+
if qty.Cmp(removeLimitSentinel) == 0 { // preserve the remove-limit sentinel as-is
530+
continue
531+
}
516532
clamped := false
517533
if minQty, ok := minAllowed[name]; ok && qty.Cmp(minQty) < 0 {
518534
qty = minQty.DeepCopy()

0 commit comments

Comments
 (0)