Skip to content

Add K8s GPU metrics collection (CW, DCGM, Prometheus)#21

Merged
maksimov merged 13 commits intomasterfrom
feature/k8s-gpu-metrics
Apr 19, 2026
Merged

Add K8s GPU metrics collection (CW, DCGM, Prometheus)#21
maksimov merged 13 commits intomasterfrom
feature/k8s-gpu-metrics

Conversation

@maksimov
Copy link
Copy Markdown
Collaborator

@maksimov maksimov commented Apr 19, 2026

Summary

  • Add GPU utilization metrics collection for K8s nodes via a per-node fallback chain: CloudWatch Container Insights → DCGM exporter scrape → Prometheus query
  • Add --prom-url and --prom-endpoint CLI flags for configuring Prometheus GPU metrics source
  • Add ruleK8sLowGPUUtil analysis rule flagging K8s GPU nodes with <10% utilization
  • Enables utilization-based waste detection for K8s GPU nodes (previously limited to allocation-based only)

Metrics sources

  1. CloudWatch Container Insights — queries node_gpu_utilization and node_gpu_memory_utilization for nodes with EC2 instance IDs. Requires the CW Observability EKS add-on.
  2. DCGM exporter scrape — auto-discovers dcgm-exporter pods by label, scrapes /metrics on port 9400 via K8s API proxy. No configuration needed.
  3. Prometheus queryavg_over_time(DCGM_FI_DEV_GPU_UTIL{node=~"..."}[7d]) via direct URL or in-cluster service proxy. Requires --prom-url or --prom-endpoint.

Each source skips nodes already enriched by a prior source. The chain runs between K8s discovery and analysis, gated by --skip-metrics.

New flags

  • --prom-url — full URL to a Prometheus-compatible API (e.g., AMP, Grafana Cloud)
  • --prom-endpoint — in-cluster service as namespace/service:port (proxied through K8s API)
  • Mutually exclusive; error if both set

New dependencies

  • prometheus/common/expfmt for parsing Prometheus text format from DCGM scrapes

Test plan

  • go build ./... passes
  • go test ./... passes (20 new tests: 5 CW, 6 DCGM, 5 Prometheus, 4 analysis rule)
  • Run gpuaudit scan --help and verify --prom-url and --prom-endpoint flags appear
  • Run gpuaudit scan --prom-url x --prom-endpoint y and verify mutual exclusion error
  • Run against a cluster with DCGM exporter and verify GPU util metrics appear
  • Run with --skip-metrics and verify no metrics enrichment occurs

maksimov added 13 commits April 19, 2026 22:42
Three-source fallback chain: CloudWatch Container Insights,
DCGM exporter scrape, and Prometheus query. Per-node fallback
with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their
Prometheus metrics endpoint via kubectl proxy to populate GPU and
GPU memory utilization on K8s node instances. Skips nodes that
already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for
Prometheus GPU metrics. Orchestrate the 3-source fallback chain
(CloudWatch Container Insights → DCGM scrape → Prometheus) between
K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but
pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal)
while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to
GPUInstance and use it for DCGM matching.

Also stop retrying CW queries after the first error — all nodes will
get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently
(likely RBAC). Log one warning, bail after 3 consecutive failures.

Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL
node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label
matches K8s hostnames, not EC2 instance IDs.
Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and
master additions (spot recommendations, multi-account scanning, diff
command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their
full test suites.
@maksimov maksimov merged commit 29c8fcc into master Apr 19, 2026
2 checks passed
@maksimov maksimov deleted the feature/k8s-gpu-metrics branch April 19, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant