Skip to content

[CONTP-1365] feat(health): Add AD annotation health check#48962

Draft
Mathew-Estafanous wants to merge 8 commits intomainfrom
mathew.estafanous/ad-annotation-health-check
Draft

[CONTP-1365] feat(health): Add AD annotation health check#48962
Mathew-Estafanous wants to merge 8 commits intomainfrom
mathew.estafanous/ad-annotation-health-check

Conversation

@Mathew-Estafanous
Copy link
Copy Markdown
Contributor

@Mathew-Estafanous Mathew-Estafanous commented Apr 7, 2026

What does this PR do?

Adds health platform reporting for autodiscovery (AD) annotation misconfigurations on Kubernetes pods. When a pod has malformed or mismatched ad.datadoghq.com/* annotations (e.g., referencing a container name that doesn't exist in the pod spec, invalid JSON syntax, or mismatched array lengths), the agent now reports these as structured health issues via the health platform component.

Motivation

Users with misconfigured AD annotations on their pods currently have no proactive feedback — the agent silently fails to schedule checks, and diagnosing the issue requires digging through agent logs or running agent configcheck. By surfacing these as health platform issues, users get actionable alerts with specific remediation steps pointing them to the affected pod and the exact error.

Describe how you validated your changes

Manual validation with injector-dev

  1. Deploy agent with health platform enabled and a test workload with a misconfigured annotation (ie. wrongname instead of redis):
platform:
  type: "minikube"
  name: "autodiscovery-demo"
  reset: false
helm:
  versions:
    agent:
      build: {}
    cluster_agent:
      build: {}
  manifests:
    - path: "redis-with-password.yaml"
      namespace: cache
  config:
    datadog:
      clusterName: mathewe-ahp-dev
      ignoreAutoConfig:
        - "redisdb"
      clusterChecks:
        enabled: true
      kubelet:
        tlsVerify: false
      envDict:
        DD_HEALTH_PLATFORM_ENABLED: "true"
    clusterAgent:
      enabled: true
---
apiVersion: v1
kind: Secret
metadata:
  name: redis-secret
stringData:
  password: "123456789"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 2
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        ad.datadoghq.com/rediss.checks: |
          {
            "redisdb": {
              "init_config": {},
              "instances": [
                {
                  "host": "%%host%%",
                  "port": "%%port%%",
                  "password": "ENC[k8s_secret@%%kube_namespace%%/redis-secret/password]"
                }
              ],
              "logs": [
                {
                  "source": "redis",
                  "service": "redis"
                }
              ]
            }
          }
    spec:
      containers:
      - name: redis
        image: redis:latest
        ports:
        - containerPort: 6379
        env:
          - name: REDIS_PASS
            valueFrom:
              secretKeyRef:
                name: redis-secret
                key: password
        command: ["sh", "-c", "redis-server --requirepass \"$REDIS_PASS\""]

The scenario deploys:

  • Agent and Cluster Agent built from the branch
  • A Redis deployment in the cache namespace with annotation ad.datadoghq.com/wrongname.checkswrongname doesn't match the actual container name redis
  • Health platform enabled via DD_HEALTH_PLATFORM_ENABLED: "true"
  1. Verify the annotation error is reported as a health issue by exec-ing into the agent pod and checking the health platform status:
kubectl exec -it <agent-pod> -- agent diagnose
image
  1. Confirm the issue references the correct entity name and error message indicating wrongname doesn't match a container identifier

  2. Fix the annotation by updating wrongnameredis in the deployment, then verify the health issue is cleared.

@dd-octo-sts dd-octo-sts bot added internal Identify a non-fork PR team/container-platform The Container Platform Team team/agent-configuration labels Apr 7, 2026
@github-actions github-actions bot added the long review PR is complex, plan time to review it label Apr 7, 2026
@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr bot commented Apr 7, 2026

Go Package Import Differences

Baseline: 9791466
Comparison: 91580ce

binaryosarchchange
agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
agentwindowsamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
agentdarwinamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
agentdarwinarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
iot-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
iot-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation
heroku-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/healthplatform/impl/issues/adannotation

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr bot commented Apr 7, 2026

Files inventory check summary

File checks results against ancestor 97914667:

Results for datadog-agent_7.79.0~devel.git.635.91580ce.pipeline.107190700-1_amd64.deb:

No change detected

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr bot commented Apr 7, 2026

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 9791466
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +16.0 KiB (0.00% increase) 755.917 → 755.933 → 757.690
agent_deb_amd64_fips +16.0 KiB (0.00% increase) 712.649 → 712.665 → 718.060
agent_heroku_amd64 +12.03 KiB (0.00% increase) 313.341 → 313.353 → 322.130
agent_msi +13.0 KiB (0.00% increase) 610.358 → 610.371 → 656.110
agent_rpm_amd64 +16.0 KiB (0.00% increase) 755.901 → 755.917 → 757.660
agent_rpm_amd64_fips +16.0 KiB (0.00% increase) 712.633 → 712.649 → 718.040
agent_rpm_arm64 +12.0 KiB (0.00% increase) 734.059 → 734.071 → 739.380
agent_rpm_arm64_fips +12.03 KiB (0.00% increase) 693.863 → 693.875 → 700.780
agent_suse_amd64 +16.0 KiB (0.00% increase) 755.901 → 755.917 → 757.660
agent_suse_amd64_fips +16.0 KiB (0.00% increase) 712.633 → 712.649 → 718.040
agent_suse_arm64 +12.0 KiB (0.00% increase) 734.059 → 734.071 → 739.380
agent_suse_arm64_fips +12.03 KiB (0.00% increase) 693.863 → 693.875 → 700.780
docker_agent_amd64 +16.0 KiB (0.00% increase) 816.164 → 816.180 → 820.010
docker_agent_arm64 +12.0 KiB (0.00% increase) 819.139 → 819.151 → 826.060
docker_agent_jmx_amd64 +16.0 KiB (0.00% increase) 1007.080 → 1007.095 → 1010.890
docker_agent_jmx_arm64 +12.0 KiB (0.00% increase) 998.833 → 998.845 → 1005.660
docker_cluster_agent_amd64 +4.0 KiB (0.00% increase) 205.547 → 205.551 → 207.600
iot_agent_deb_amd64 +8.0 KiB (0.02% increase) 44.158 → 44.166 → 44.970
iot_agent_deb_arm64 +8.03 KiB (0.02% increase) 41.158 → 41.166 → 42.560
iot_agent_deb_armhf +12.02 KiB (0.03% increase) 41.890 → 41.902 → 42.740
iot_agent_rpm_amd64 +8.0 KiB (0.02% increase) 44.159 → 44.166 → 44.970
iot_agent_suse_amd64 +8.0 KiB (0.02% increase) 44.159 → 44.166 → 44.970
9 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
docker_cluster_agent_arm64 219.821 MiB
docker_cws_instrumentation_amd64 7.142 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 39.500 MiB
docker_dogstatsd_arm64 37.707 MiB
dogstatsd_deb_amd64 30.148 MiB
dogstatsd_deb_arm64 28.285 MiB
dogstatsd_rpm_amd64 30.148 MiB
dogstatsd_suse_amd64 30.148 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +71.37 KiB (0.04% increase) 175.750 → 175.820 → 179.410
agent_deb_amd64_fips -17.84 KiB (0.01% reduction) 167.266 → 167.249 → 174.660
agent_heroku_amd64 neutral 75.189 MiB → 80.310
agent_msi neutral 139.926 MiB → 147.550
agent_rpm_amd64 neutral 177.822 MiB → 182.280
agent_rpm_amd64_fips +27.44 KiB (0.02% increase) 168.690 → 168.717 → 174.430
agent_rpm_arm64 +130.76 KiB (0.08% increase) 160.169 → 160.297 → 163.800
agent_rpm_arm64_fips -23.2 KiB (0.01% reduction) 152.209 → 152.187 → 157.120
agent_suse_amd64 neutral 177.822 MiB → 182.280
agent_suse_amd64_fips +27.44 KiB (0.02% increase) 168.690 → 168.717 → 174.430
agent_suse_arm64 +130.76 KiB (0.08% increase) 160.169 → 160.297 → 163.800
agent_suse_arm64_fips -23.2 KiB (0.01% reduction) 152.209 → 152.187 → 157.120
docker_agent_amd64 neutral 269.652 MiB → 274.040
docker_agent_arm64 +9.23 KiB (0.00% increase) 256.717 → 256.726 → 262.520
docker_agent_jmx_amd64 +5.99 KiB (0.00% increase) 338.297 → 338.303 → 342.660
docker_agent_jmx_arm64 +13.53 KiB (0.00% increase) 321.347 → 321.360 → 327.100
docker_cluster_agent_amd64 neutral 72.018 MiB → 73.460
docker_cluster_agent_arm64 neutral 67.539 MiB → 68.680
docker_cws_instrumentation_amd64 neutral 2.999 MiB → 3.330
docker_cws_instrumentation_arm64 neutral 2.729 MiB → 3.090
docker_dogstatsd_amd64 neutral 15.268 MiB → 15.870
docker_dogstatsd_arm64 neutral 14.574 MiB → 14.890
dogstatsd_deb_amd64 neutral 7.955 MiB → 8.830
dogstatsd_deb_arm64 neutral 6.837 MiB → 7.750
dogstatsd_rpm_amd64 neutral 7.967 MiB → 8.840
dogstatsd_suse_amd64 neutral 7.967 MiB → 8.840
iot_agent_deb_amd64 +6.98 KiB (0.06% increase) 11.625 → 11.632 → 13.210
iot_agent_deb_arm64 +4.15 KiB (0.04% increase) 9.935 → 9.939 → 11.620
iot_agent_deb_armhf neutral 10.142 MiB → 11.780
iot_agent_rpm_amd64 +3.83 KiB (0.03% increase) 11.641 → 11.645 → 13.230
iot_agent_suse_amd64 +3.83 KiB (0.03% increase) 11.641 → 11.645 → 13.230

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da bot commented Apr 7, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 33bdfb6b-c77c-4942-8f2e-3bfb3bad7494

Baseline: 9791466
Comparison: 91580ce
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization -0.81 [-3.82, +2.20] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
ddot_metrics_sum_cumulative memory utilization +0.79 [+0.65, +0.94] 1 Logs
quality_gate_idle_all_features memory utilization +0.45 [+0.42, +0.49] 1 Logs bounds checks dashboard
otlp_ingest_metrics memory utilization +0.24 [+0.08, +0.41] 1 Logs
ddot_metrics memory utilization +0.18 [-0.00, +0.35] 1 Logs
file_to_blackhole_1000ms_latency egress throughput +0.05 [-0.39, +0.49] 1 Logs
file_to_blackhole_100ms_latency egress throughput +0.03 [-0.11, +0.16] 1 Logs
quality_gate_metrics_logs memory utilization +0.01 [-0.22, +0.24] 1 Logs bounds checks dashboard
file_to_blackhole_500ms_latency egress throughput +0.00 [-0.41, +0.41] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.21, +0.20] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.01 [-0.11, +0.10] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput -0.03 [-0.24, +0.18] 1 Logs
file_to_blackhole_0ms_latency egress throughput -0.07 [-0.61, +0.46] 1 Logs
quality_gate_idle memory utilization -0.14 [-0.19, -0.09] 1 Logs bounds checks dashboard
file_tree memory utilization -0.17 [-0.23, -0.11] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.19 [-0.36, -0.02] 1 Logs
ddot_metrics_sum_delta memory utilization -0.21 [-0.38, -0.03] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.28 [-0.34, -0.22] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization -0.41 [-0.63, -0.19] 1 Logs
ddot_logs memory utilization -0.57 [-0.63, -0.50] 1 Logs
docker_containers_memory memory utilization -0.61 [-0.70, -0.52] 1 Logs
docker_containers_cpu % cpu utilization -0.81 [-3.82, +2.20] 1 Logs
otlp_ingest_logs memory utilization -1.22 [-1.32, -1.11] 1 Logs
quality_gate_logs % cpu utilization -2.02 [-3.62, -0.41] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 713 ≥ 26
docker_containers_memory memory_usage 10/10 275.64MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 687 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.23GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.20GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.22GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 10/10 3 = 3 bounds checks dashboard
quality_gate_idle memory_usage 10/10 175.38MiB ≤ 181MiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 3 = 3 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 493.99MiB ≤ 550MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 206.73MiB ≤ 220MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 364.65 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 3 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 435.28MiB ≤ 475MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

@Mathew-Estafanous Mathew-Estafanous self-assigned this Apr 8, 2026
@Mathew-Estafanous Mathew-Estafanous added the qa/done QA done before merge and regressions are covered by tests label Apr 9, 2026
@Mathew-Estafanous Mathew-Estafanous force-pushed the mathew.estafanous/ad-annotation-health-check branch from d062796 to fd75ae6 Compare April 9, 2026 19:20
@Mathew-Estafanous Mathew-Estafanous changed the title feat(health): Add AD annotation health check [CONTP-1365] feat(health): Add AD annotation health check Apr 10, 2026
@Mathew-Estafanous Mathew-Estafanous added the changelog/no-changelog No changelog entry needed label Apr 10, 2026
@Mathew-Estafanous Mathew-Estafanous force-pushed the mathew.estafanous/ad-annotation-health-check branch from 2ba89c1 to 5dbb183 Compare April 10, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed internal Identify a non-fork PR long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/agent-configuration team/agent-discovery team/agent-health team/agent-integrations team/agent-log-pipelines team/agent-runtimes team/container-platform The Container Platform Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant