Add K8s GPU metrics collection (CW, DCGM, Prometheus)#21
Merged
Conversation
Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.
Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--prom-urland--prom-endpointCLI flags for configuring Prometheus GPU metrics sourceruleK8sLowGPUUtilanalysis rule flagging K8s GPU nodes with <10% utilizationMetrics sources
node_gpu_utilizationandnode_gpu_memory_utilizationfor nodes with EC2 instance IDs. Requires the CW Observability EKS add-on./metricson port 9400 via K8s API proxy. No configuration needed.avg_over_time(DCGM_FI_DEV_GPU_UTIL{node=~"..."}[7d])via direct URL or in-cluster service proxy. Requires--prom-urlor--prom-endpoint.Each source skips nodes already enriched by a prior source. The chain runs between K8s discovery and analysis, gated by
--skip-metrics.New flags
--prom-url— full URL to a Prometheus-compatible API (e.g., AMP, Grafana Cloud)--prom-endpoint— in-cluster service asnamespace/service:port(proxied through K8s API)New dependencies
prometheus/common/expfmtfor parsing Prometheus text format from DCGM scrapesTest plan
go build ./...passesgo test ./...passes (20 new tests: 5 CW, 6 DCGM, 5 Prometheus, 4 analysis rule)gpuaudit scan --helpand verify--prom-urland--prom-endpointflags appeargpuaudit scan --prom-url x --prom-endpoint yand verify mutual exclusion error--skip-metricsand verify no metrics enrichment occurs