Skip to content

fix(autoscaling): release .spec.replicas ownership when DPA stops scaling#52040

Draft
celenechang wants to merge 5 commits into
mainfrom
celene/dpa-release-replicas-ownership
Draft

fix(autoscaling): release .spec.replicas ownership when DPA stops scaling#52040
celenechang wants to merge 5 commits into
mainfrom
celene/dpa-release-replicas-ownership

Conversation

@celenechang

Copy link
Copy Markdown
Contributor

Summary

Resolves a stale-field-manager leak in the workload autoscaler: when the cluster agent scales a target workload via the scale subresource, Kubernetes records a datadog-cluster-agent managed-fields entry on .spec.replicas. The cluster agent never released that entry, so once the DPA stopped scaling (mode switch to vertical-only, or DPA deletion) the stale entry remained forever — causing server-side-apply writers such as Helm to fail with:

error: Apply failed with 1 conflict: conflict with "datadog-cluster-agent" with subresource "scale" using apps/v1: .spec.replicas

…unless the user passed --force-conflicts indefinitely. Plain helm upgrade --install (client-side apply) was not affected.

Changes

  • scaler.go — new releaseReplicasOwnership(ctx, ns, name, gvk) method: GETs the parent resource via the dynamic client, finds managed-fields entries where Manager == "datadog-cluster-agent" AND Subresource == "scale", and issues a JSON patch to drop them. Idempotent (no-op when no such entry exists). Treats target-NotFound as already-released.
  • controller_horizontal.go — on the !IsHorizontalScalingEnabled() branch, call the release before ClearHorizontalState(). Gated on HorizontalLastActions being non-empty so we don't issue a GET on every reconcile while horizontal stays disabled (clearing horizontal state nils actions, so subsequent ticks skip).
  • controller.go — call the release in both Deleted() branches (remote-owned and profile-managed) before deletePodAutoscaler.
  • Cleanup is best-effort: errors are logged (Warnf) but never propagated. Deletion and mode-switch logic are never blocked.

Non-goals / what this PR does NOT do

  • Does not switch DCA scale writes from Scales().Update() to server-side apply. The release path uses a JSON patch on metadata.managedFields directly, keeping blast radius minimal. Migrating to SSA is a separate, larger discussion.
  • Does not strip managed-fields entries from non-scale subresources, or from any manager other than datadog-cluster-agent.

Test plan

  • New unit tests in scaler_test.go:
    • removes the DCA scale entry, preserves other managers
    • no-op when no DCA entry exists (no PATCH issued)
    • target-NotFound returns nil
    • DCA entry without Subresource: "scale" is left untouched
  • New unit test in controller_horizontal_test.go: disabling horizontal scaling with prior actions triggers exactly one release; without prior actions, no release.
  • Existing test TestLeaderCreateDeleteRemote extended to assert release is called before deletePodAutoscaler.
  • go test -tags "kubeapiserver test" ./pkg/clusteragent/autoscaling/workload/... — all pass.
  • gofmt, go vet clean.

Reviewer notes

  • The field-manager name "datadog-cluster-agent" is derived from the binary's user-agent, which matches the manager string seen in customer-reported managedFields entries. Codified as a const in scaler.go.
  • For the disable path, I considered switching scale writes to SSA so that ownership is releasable via an empty Apply, but that touches every scale event for every DPA customer and felt out of scope.

🤖 Generated with Claude Code

…ling

When the cluster agent scales a workload via the DatadogPodAutoscaler,
Kubernetes records a `datadog-cluster-agent` field-manager entry on the
target's `.spec.replicas` via the scale subresource. The cluster agent
never relinquished that entry, so once the DPA stopped scaling (mode
switch to vertical-only, or DPA deletion) the stale manager remained
forever — causing server-side-apply writers such as Helm to fail with
`Apply failed with 1 conflict: conflict with "datadog-cluster-agent"`
unless the user passed --force-conflicts indefinitely.

This adds a `releaseReplicasOwnership` method on the scaler that does a
JSON patch to drop the cluster agent's `scale`-subresource entry from
`metadata.managedFields`, and wires it into:
  - controller_horizontal.go: on horizontal-disabled transition,
    gated on HorizontalLastActions so we don't issue a GET every
    reconcile after the cleanup has run.
  - controller.go: before deletePodAutoscaler in both the remote-owned
    and profile-managed delete paths.

Cleanup is best-effort: errors are logged but never block deletion or
mode-switch logic. The method is idempotent and treats target-NotFound
as a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@dd-octo-sts dd-octo-sts Bot added the internal Identify a non-fork PR label Jun 9, 2026
@github-actions github-actions Bot added the medium review PR review might take time label Jun 9, 2026
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 6 Pipeline jobs failed

DataDog/datadog-agent | lint_cross_macos-arm64   View in Datadog   GitLab

DataDog/datadog-agent | lint_cross_macos-x64   View in Datadog   GitLab

DataDog/datadog-agent | lint_cross_windows-x64   View in Datadog   GitLab

View all 6 failed jobs.

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 83.78%
Overall Coverage: 50.61% (-4.53%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fdef0c5 | Docs | Datadog PR Page | Give us feedback!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54db06af6c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread pkg/clusteragent/autoscaling/workload/scaler.go
Comment thread pkg/clusteragent/autoscaling/workload/controller.go
…h RBAC

Addresses two review findings:

  1. The previous commit only released `.spec.replicas` ownership on the
     remote-owned and profile-managed delete paths. Local-owned DPAs
     (deleted via `kubectl delete dpa ...`) reach the `podAutoscaler == nil`
     branch at the top of syncPodAutoscaler and clear the internal store
     without calling release, so any workload that was previously scaled
     by that DPA still leaked a stale `datadog-cluster-agent` scale
     managedFields entry. The release call is now made on that path too.

  2. The cluster agent's in-tree RBAC manifests only grant `get` on
     `apps/{deployments,statefulsets,replicasets,daemonsets}` and core
     `replicationcontrollers`, which meant the new JSON patch on
     `metadata.managedFields` would be rejected as Forbidden. Add the
     `patch` verb on the scaleable resources (excluding daemonsets, which
     have no `/scale` subresource and therefore cannot be DPA targets)
     across all eight `Dockerfiles/manifests/*/cluster-agent-rbac.yaml`
     files. Out-of-tree manifests (Helm chart, Operator) need the same
     bump separately — called out in the release note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@github-actions github-actions Bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Jun 9, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Files inventory check summary

File checks results against ancestor 94b2da60:

Results for datadog-agent_7.81.0~devel.git.645.fdef0c5.pipeline.117879779-1_amd64.deb:

No change detected

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a06ee93b0e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread pkg/clusteragent/autoscaling/workload/scaler.go Outdated
@dd-octo-sts

dd-octo-sts Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 94b2da6
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

Quality gate Change Size (prev → curr → max)
docker_cluster_agent_amd64 +12.01 KiB (0.01% increase, -1.64% of buffer) 209.753 → 209.765 → 210.470
docker_cluster_agent_arm64 -64.0 KiB (0.03% reduction, +89.26% of buffer) 222.910 → 222.847 → 222.980
30 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
agent_deb_amd64 751.655 MiB
agent_deb_amd64_fips 707.422 MiB
agent_heroku_amd64 310.320 MiB
agent_rpm_amd64 751.639 MiB
agent_rpm_amd64_fips 707.406 MiB
agent_rpm_arm64 727.042 MiB
agent_rpm_arm64_fips 686.557 MiB
agent_suse_amd64 751.639 MiB
agent_suse_amd64_fips 707.406 MiB
agent_suse_arm64 727.042 MiB
agent_suse_arm64_fips 686.557 MiB
docker_agent_amd64 811.084 MiB
docker_agent_arm64 811.516 MiB
docker_agent_jmx_amd64 1002.025 MiB
docker_agent_jmx_arm64 991.109 MiB
docker_cws_instrumentation_amd64 7.447 MiB
docker_cws_instrumentation_arm64 6.877 MiB
docker_dogstatsd_amd64 39.845 MiB
docker_dogstatsd_arm64 37.946 MiB
docker_host_profiler_amd64 304.525 MiB
docker_host_profiler_arm64 315.637 MiB
dogstatsd_deb_amd64 30.502 MiB
dogstatsd_deb_arm64 28.499 MiB
dogstatsd_rpm_amd64 30.502 MiB
dogstatsd_suse_amd64 30.502 MiB
iot_agent_deb_amd64 45.762 MiB
iot_agent_deb_arm64 42.476 MiB
iot_agent_deb_armhf 43.270 MiB
iot_agent_rpm_amd64 45.762 MiB
iot_agent_suse_amd64 45.761 MiB

When releaseReplicasOwnership computes managedFields indices off a GET
and then issues a JSON PATCH using those indices, a concurrent writer
(Helm SSA, kubectl, another controller) can mutate managedFields between
the two calls. If managedFields shifts, the numeric `remove` op no
longer points at the cluster agent's scale entry and could silently
drop a foreign controller's ownership entry.

Precede each `remove` with two `test` ops asserting
`/metadata/managedFields/<idx>/manager == "datadog-cluster-agent"` and
`/metadata/managedFields/<idx>/subresource == "scale"`. JSON patches
are atomic — if any test fails the whole patch is rejected as 422
Invalid, and we just retry on the next reconcile with a fresh snapshot.

Tests:
  - TestScalerReleaseReplicasOwnership_PatchIsTestGuarded asserts the
    emitted patch body contains the expected test+test+remove triplet.
  - TestScalerReleaseReplicasOwnership_TestOpFailureRejectsPatch uses a
    reactor to surface a stale GET vs. a different stored state, and
    confirms the patch is rejected rather than silently misapplied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb70ab1dfb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread pkg/clusteragent/autoscaling/workload/controller_horizontal.go
@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented Jun 9, 2026

Copy link
Copy Markdown

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 73ad0ae8-e85c-4145-a332-045fad81b80c

Baseline: 94b2da6
Comparison: fdef0c5
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
quality_gate_logs % cpu utilization +1.63 [+0.58, +2.68] 1 Logs bounds checks dashboard
quality_gate_idle_all_features memory utilization +0.12 [+0.08, +0.15] 1 Logs bounds checks dashboard
quality_gate_idle memory utilization +0.00 [-0.05, +0.05] 1 Logs bounds checks dashboard
quality_gate_metrics_logs memory utilization -0.28 [-0.54, -0.02] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
quality_gate_idle intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle memory_usage 10/10 144.58MiB ≤ 147MiB bounds checks dashboard
quality_gate_idle total_bytes_received 10/10 731.56KiB ≤ 819.20KiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 480.29MiB ≤ 495MiB bounds checks dashboard
quality_gate_idle_all_features total_bytes_received 10/10 1.13MiB ≤ 1.25MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 181.82MiB ≤ 195MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_logs total_bytes_received 10/10 264.37MiB ≤ 292MiB bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 338.46 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 3 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 405.60MiB ≤ 430MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs total_bytes_received 10/10 0.93GiB ≤ 1.04GiB bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Replicate Execution Details

We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.

Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.

Experiment Variant Replicates Failure Logs Debug Dashboard
experiment_with_failures baseline 0 (x8) 1 (x6) Oom killed Debug Dashboard

The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.

❌ Retried Profiling Replicate Execution Failures (ddprof)

Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.

Experiment Variant Replicates Failure Debug Dashboard
quality_gate_idle baseline 10 Oom killed Debug Dashboard
quality_gate_idle_all_features baseline 10 Oom killed Debug Dashboard
quality_gate_idle_all_features comparison 10 Oom killed Debug Dashboard
quality_gate_logs baseline 10 Oom killed Debug Dashboard
quality_gate_logs comparison 10 Oom killed Debug Dashboard
quality_gate_metrics_logs baseline 10 Oom killed Debug Dashboard
quality_gate_metrics_logs comparison 10 Oom killed Debug Dashboard

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.

The disable-path cleanup was clearing HorizontalLastActions even when
releaseReplicasOwnership returned an error. Since the release call is
gated on `len(HorizontalLastActions) > 0`, clearing on failure meant
the very next reconcile would skip the release entirely and the stale
`datadog-cluster-agent` scale managedFields entry would leak forever —
defeating the whole point of the fix when transient failures (missing
RBAC, JSON-patch test op race, API blip) showed up.

Only ClearHorizontalState after a successful release. On failure,
return Requeue without an error so the workqueue retries with normal
backoff (a workqueue error would consume the bounded retry budget; we
want unbounded retries here because the failure can be transient).

Added TestHorizontalControllerReleaseOwnershipOnDisable_FailureRetainsState
asserting both the Requeue result and that HorizontalLastActions is
preserved across a failed release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f4a0faa311

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +74 to +76
if err := hr.scaler.releaseReplicasOwnership(ctx, autoscalerInternal.Namespace(), autoscalerInternal.Spec().TargetRef.Name, targetGVK); err != nil {
log.Warnf("Failed to release replicas ownership for %s %s/%s after disabling horizontal scaling, will retry: %v", targetGVK.Kind, autoscalerInternal.Namespace(), autoscalerInternal.Spec().TargetRef.Name, err)
return autoscaling.Requeue, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add rollout patch RBAC before retrying cleanup

When horizontal scaling is disabled after prior actions, this new path treats any releaseReplicasOwnership error as retryable and does not clear horizontal state. The repo enables Argo Rollout targets when the CRD exists (provider.go adds argoproj.io/v1alpha1 rollouts), but the manifest changes only grant patch on apps resources and replicationcontrollers, while the existing argoproj.io rule remains get/list/watch. In clusters autoscaling Rollouts from these bundled manifests, this call will be forbidden forever, causing every reconcile to requeue and preventing horizontal state from clearing.

Useful? React with 👍 / 👎.

…NotFound

releaseReplicasOwnership iterated REST mappings and short-circuited to
nil as soon as ANY mapping returned NotFound, even if an earlier mapping
had returned a real error (Forbidden, transient API failure) that was
stored in firstErr. For multi-version CRD targets with partial RBAC
rollout — e.g. patch granted on apps/v1 but not on a parallel apps/v1beta
mapping — this caused the function to look successful while the actual
managedFields entry was never touched, and the disable-path then cleared
horizontal state so no future reconcile would retry.

Multi-version mappings all resolve to the same etcd object, so iterating
serves no purpose. Use only mappings[0] (the preferred mapping) and let
its error propagate. RESTMappings error now wrapped with %w so callers
can errors.Is/As distinguish e.g. apimeta.NoKindMatchError from transient
failures.

Added TestScalerReleaseReplicasOwnership_ForbiddenSurfacesEvenWithMultipleMappings
covering the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@celenechang celenechang force-pushed the celene/dpa-release-replicas-ownership branch from b84a9ab to fdef0c5 Compare June 10, 2026 03:01
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@DataDog DataDog deleted a comment from github-actions Bot Jun 10, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdef0c5f1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +241 to +243
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "deployments"]
verbs: ["patch"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Grant patch permission for Rollout targets

When a DatadogPodAutoscaler targets an Argo Rollout, the new releaseReplicasOwnership path patches the target GVR directly, but this RBAC addition only grants patch on built-in apps/core workloads while the same manifest still only grants get/list/watch on argoproj.io rollouts below. In that supported Rollout scenario, switching horizontal scaling off now hits a permanent Forbidden error and the disable path keeps requeueing instead of clearing horizontal state; deletion cleanup also cannot remove the stale managedFields entry. Please add patch on argoproj.io rollouts in each generated manifest variant.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Identify a non-fork PR long review PR is complex, plan time to review it team/container-integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant