fix(autoscaling): release .spec.replicas ownership when DPA stops scaling#52040
fix(autoscaling): release .spec.replicas ownership when DPA stops scaling#52040celenechang wants to merge 5 commits into
Conversation
…ling
When the cluster agent scales a workload via the DatadogPodAutoscaler,
Kubernetes records a `datadog-cluster-agent` field-manager entry on the
target's `.spec.replicas` via the scale subresource. The cluster agent
never relinquished that entry, so once the DPA stopped scaling (mode
switch to vertical-only, or DPA deletion) the stale manager remained
forever — causing server-side-apply writers such as Helm to fail with
`Apply failed with 1 conflict: conflict with "datadog-cluster-agent"`
unless the user passed --force-conflicts indefinitely.
This adds a `releaseReplicasOwnership` method on the scaler that does a
JSON patch to drop the cluster agent's `scale`-subresource entry from
`metadata.managedFields`, and wires it into:
- controller_horizontal.go: on horizontal-disabled transition,
gated on HorizontalLastActions so we don't issue a GET every
reconcile after the cleanup has run.
- controller.go: before deletePodAutoscaler in both the remote-owned
and profile-managed delete paths.
Cleanup is best-effort: errors are logged but never block deletion or
mode-switch logic. The method is idempotent and treats target-NotFound
as a no-op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54db06af6c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…h RBAC
Addresses two review findings:
1. The previous commit only released `.spec.replicas` ownership on the
remote-owned and profile-managed delete paths. Local-owned DPAs
(deleted via `kubectl delete dpa ...`) reach the `podAutoscaler == nil`
branch at the top of syncPodAutoscaler and clear the internal store
without calling release, so any workload that was previously scaled
by that DPA still leaked a stale `datadog-cluster-agent` scale
managedFields entry. The release call is now made on that path too.
2. The cluster agent's in-tree RBAC manifests only grant `get` on
`apps/{deployments,statefulsets,replicasets,daemonsets}` and core
`replicationcontrollers`, which meant the new JSON patch on
`metadata.managedFields` would be rejected as Forbidden. Add the
`patch` verb on the scaleable resources (excluding daemonsets, which
have no `/scale` subresource and therefore cannot be DPA targets)
across all eight `Dockerfiles/manifests/*/cluster-agent-rbac.yaml`
files. Out-of-tree manifests (Helm chart, Operator) need the same
bump separately — called out in the release note.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
Files inventory check summaryFile checks results against ancestor 94b2da60: Results for datadog-agent_7.81.0~devel.git.645.fdef0c5.pipeline.117879779-1_amd64.deb:No change detected |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a06ee93b0e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
30 successful checks with minimal change (< 2 KiB)
|
When releaseReplicasOwnership computes managedFields indices off a GET
and then issues a JSON PATCH using those indices, a concurrent writer
(Helm SSA, kubectl, another controller) can mutate managedFields between
the two calls. If managedFields shifts, the numeric `remove` op no
longer points at the cluster agent's scale entry and could silently
drop a foreign controller's ownership entry.
Precede each `remove` with two `test` ops asserting
`/metadata/managedFields/<idx>/manager == "datadog-cluster-agent"` and
`/metadata/managedFields/<idx>/subresource == "scale"`. JSON patches
are atomic — if any test fails the whole patch is rejected as 422
Invalid, and we just retry on the next reconcile with a fresh snapshot.
Tests:
- TestScalerReleaseReplicasOwnership_PatchIsTestGuarded asserts the
emitted patch body contains the expected test+test+remove triplet.
- TestScalerReleaseReplicasOwnership_TestOpFailureRejectsPatch uses a
reactor to surface a stale GET vs. a different stored state, and
confirms the patch is rejected rather than silently misapplied.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb70ab1dfb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 94b2da6 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +1.63 | [+0.58, +2.68] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.12 | [+0.08, +0.15] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.00 | [-0.05, +0.05] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.28 | [-0.54, -0.02] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 144.58MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 731.56KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 480.29MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.13MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 181.82MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.37MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 338.46 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 405.60MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.93GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (ddprof)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | comparison | 10 | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
The disable-path cleanup was clearing HorizontalLastActions even when releaseReplicasOwnership returned an error. Since the release call is gated on `len(HorizontalLastActions) > 0`, clearing on failure meant the very next reconcile would skip the release entirely and the stale `datadog-cluster-agent` scale managedFields entry would leak forever — defeating the whole point of the fix when transient failures (missing RBAC, JSON-patch test op race, API blip) showed up. Only ClearHorizontalState after a successful release. On failure, return Requeue without an error so the workqueue retries with normal backoff (a workqueue error would consume the bounded retry budget; we want unbounded retries here because the failure can be transient). Added TestHorizontalControllerReleaseOwnershipOnDisable_FailureRetainsState asserting both the Requeue result and that HorizontalLastActions is preserved across a failed release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4a0faa311
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if err := hr.scaler.releaseReplicasOwnership(ctx, autoscalerInternal.Namespace(), autoscalerInternal.Spec().TargetRef.Name, targetGVK); err != nil { | ||
| log.Warnf("Failed to release replicas ownership for %s %s/%s after disabling horizontal scaling, will retry: %v", targetGVK.Kind, autoscalerInternal.Namespace(), autoscalerInternal.Spec().TargetRef.Name, err) | ||
| return autoscaling.Requeue, nil |
There was a problem hiding this comment.
Add rollout patch RBAC before retrying cleanup
When horizontal scaling is disabled after prior actions, this new path treats any releaseReplicasOwnership error as retryable and does not clear horizontal state. The repo enables Argo Rollout targets when the CRD exists (provider.go adds argoproj.io/v1alpha1 rollouts), but the manifest changes only grant patch on apps resources and replicationcontrollers, while the existing argoproj.io rule remains get/list/watch. In clusters autoscaling Rollouts from these bundled manifests, this call will be forbidden forever, causing every reconcile to requeue and preventing horizontal state from clearing.
Useful? React with 👍 / 👎.
…NotFound releaseReplicasOwnership iterated REST mappings and short-circuited to nil as soon as ANY mapping returned NotFound, even if an earlier mapping had returned a real error (Forbidden, transient API failure) that was stored in firstErr. For multi-version CRD targets with partial RBAC rollout — e.g. patch granted on apps/v1 but not on a parallel apps/v1beta mapping — this caused the function to look successful while the actual managedFields entry was never touched, and the disable-path then cleared horizontal state so no future reconcile would retry. Multi-version mappings all resolve to the same etcd object, so iterating serves no purpose. Use only mappings[0] (the preferred mapping) and let its error propagate. RESTMappings error now wrapped with %w so callers can errors.Is/As distinguish e.g. apimeta.NoKindMatchError from transient failures. Added TestScalerReleaseReplicasOwnership_ForbiddenSurfacesEvenWithMultipleMappings covering the regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b84a9ab to
fdef0c5
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fdef0c5f1d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| - apiGroups: ["apps"] | ||
| resources: ["statefulsets", "replicasets", "deployments"] | ||
| verbs: ["patch"] |
There was a problem hiding this comment.
Grant patch permission for Rollout targets
When a DatadogPodAutoscaler targets an Argo Rollout, the new releaseReplicasOwnership path patches the target GVR directly, but this RBAC addition only grants patch on built-in apps/core workloads while the same manifest still only grants get/list/watch on argoproj.io rollouts below. In that supported Rollout scenario, switching horizontal scaling off now hits a permanent Forbidden error and the disable path keeps requeueing instead of clearing horizontal state; deletion cleanup also cannot remove the stale managedFields entry. Please add patch on argoproj.io rollouts in each generated manifest variant.
Useful? React with 👍 / 👎.
Summary
Resolves a stale-field-manager leak in the workload autoscaler: when the cluster agent scales a target workload via the
scalesubresource, Kubernetes records adatadog-cluster-agentmanaged-fields entry on.spec.replicas. The cluster agent never released that entry, so once the DPA stopped scaling (mode switch to vertical-only, or DPA deletion) the stale entry remained forever — causing server-side-apply writers such as Helm to fail with:…unless the user passed
--force-conflictsindefinitely. Plainhelm upgrade --install(client-side apply) was not affected.Changes
scaler.go— newreleaseReplicasOwnership(ctx, ns, name, gvk)method: GETs the parent resource via the dynamic client, finds managed-fields entries whereManager == "datadog-cluster-agent"ANDSubresource == "scale", and issues a JSON patch to drop them. Idempotent (no-op when no such entry exists). Treats target-NotFound as already-released.controller_horizontal.go— on the!IsHorizontalScalingEnabled()branch, call the release beforeClearHorizontalState(). Gated onHorizontalLastActionsbeing non-empty so we don't issue a GET on every reconcile while horizontal stays disabled (clearing horizontal state nils actions, so subsequent ticks skip).controller.go— call the release in bothDeleted()branches (remote-owned and profile-managed) beforedeletePodAutoscaler.Warnf) but never propagated. Deletion and mode-switch logic are never blocked.Non-goals / what this PR does NOT do
Scales().Update()to server-side apply. The release path uses a JSON patch onmetadata.managedFieldsdirectly, keeping blast radius minimal. Migrating to SSA is a separate, larger discussion.datadog-cluster-agent.Test plan
scaler_test.go:Subresource: "scale"is left untouchedcontroller_horizontal_test.go: disabling horizontal scaling with prior actions triggers exactly one release; without prior actions, no release.TestLeaderCreateDeleteRemoteextended to assert release is called beforedeletePodAutoscaler.go test -tags "kubeapiserver test" ./pkg/clusteragent/autoscaling/workload/...— all pass.gofmt,go vetclean.Reviewer notes
"datadog-cluster-agent"is derived from the binary's user-agent, which matches the manager string seen in customer-reported managedFields entries. Codified as aconstinscaler.go.🤖 Generated with Claude Code