Add K8s GPU metrics collection (CW, DCGM, Prometheus) by maksimov · Pull Request #21 · gpuaudit/cli

maksimov · 2026-04-19T22:03:15Z

Summary

Add GPU utilization metrics collection for K8s nodes via a per-node fallback chain: CloudWatch Container Insights → DCGM exporter scrape → Prometheus query
Add --prom-url and --prom-endpoint CLI flags for configuring Prometheus GPU metrics source
Add ruleK8sLowGPUUtil analysis rule flagging K8s GPU nodes with <10% utilization
Enables utilization-based waste detection for K8s GPU nodes (previously limited to allocation-based only)

Metrics sources

CloudWatch Container Insights — queries node_gpu_utilization and node_gpu_memory_utilization for nodes with EC2 instance IDs. Requires the CW Observability EKS add-on.
DCGM exporter scrape — auto-discovers dcgm-exporter pods by label, scrapes /metrics on port 9400 via K8s API proxy. No configuration needed.
Prometheus query — avg_over_time(DCGM_FI_DEV_GPU_UTIL{node=~"..."}[7d]) via direct URL or in-cluster service proxy. Requires --prom-url or --prom-endpoint.

Each source skips nodes already enriched by a prior source. The chain runs between K8s discovery and analysis, gated by --skip-metrics.

New flags

--prom-url — full URL to a Prometheus-compatible API (e.g., AMP, Grafana Cloud)
--prom-endpoint — in-cluster service as namespace/service:port (proxied through K8s API)
Mutually exclusive; error if both set

New dependencies

prometheus/common/expfmt for parsing Prometheus text format from DCGM scrapes

Test plan

go build ./... passes
go test ./... passes (20 new tests: 5 CW, 6 DCGM, 5 Prometheus, 4 analysis rule)
Run gpuaudit scan --help and verify --prom-url and --prom-endpoint flags appear
Run gpuaudit scan --prom-url x --prom-endpoint y and verify mutual exclusion error
Run against a cluster with DCGM exporter and verify GPU util metrics appear
Run with --skip-metrics and verify no metrics enrichment occurs

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.

DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.

Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.

maksimov added 13 commits April 19, 2026 22:42

Add K8s GPU metrics collection design spec

22cf265

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Add K8s GPU metrics collection implementation plan

ee8e309

Add EnrichK8sGPUMetrics for CloudWatch Container Insights GPU metrics

879f2c1

Add ProxyGet to K8sClient interface for pod API proxy

9a176fa

Add Prometheus query enrichment for K8s GPU metrics

98003ef

Add ruleK8sLowGPUUtil for utilization-based K8s GPU waste detection

1a35f95

Wire K8s GPU metrics fallback chain into CLI scan flow

89d9cb3

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

Include time window in low GPU utilization recommendation text

fa00dff

Skip CW enrichment when AWS creds unavailable, reduce DCGM noise

51db9f4

Merge origin/master into feature/k8s-gpu-metrics

0a1b2a1

Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.

maksimov merged commit 29c8fcc into master Apr 19, 2026
2 checks passed

maksimov deleted the feature/k8s-gpu-metrics branch April 19, 2026 23:19

maksimov mentioned this pull request Apr 20, 2026

DCGM agent integration for real GPU utilization #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add K8s GPU metrics collection (CW, DCGM, Prometheus)#21

Add K8s GPU metrics collection (CW, DCGM, Prometheus)#21
maksimov merged 13 commits intomasterfrom
feature/k8s-gpu-metrics

maksimov commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maksimov commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics sources

New flags

New dependencies

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maksimov commented Apr 19, 2026 •

edited

Loading