gpuaudit/feature/add_csv_output_format#28
Closed
sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
Closed
gpuaudit/feature/add_csv_output_format#28sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
Conversation
Removed reference to ARCHITECTURE.md from the README.
SmallerAlternatives now prefers same-family instances first (e.g. g6e.xlarge for g6e.12xlarge), then same-GPU-model, then others. Previously it picked the globally cheapest single-GPU which could recommend a T4 to replace an L40S. Table columns widened to show more of instance names, types, and recommendation text from real-world scan output.
…s below threshold
Scans EKS clusters for managed node groups running GPU instance types. Adds --skip-eks flag and EKS IAM permissions to iam-policy output. Closes #1
Scans K8s clusters via kubeconfig to find nodes with nvidia.com/gpu allocatable resources and pods requesting GPUs. Reports idle GPU nodes (no pods scheduled) and partially allocated nodes as waste signals. Adds --kubeconfig, --kube-context, and --skip-k8s flags. AWS scan failure is now non-fatal when K8s scan is enabled. Refs #1
Strip domain suffix (.ec2.internal etc.) from node names in output for readability.
When instance type is not in the pricing DB, check karpenter.k8s.aws/instance-gpu-name and nvidia.com/gpu.product node labels to identify the GPU model.
Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.
gpuaudit diff old.json new.json [--format table|json] Closes #5
The recommendation said "No GPU pods scheduled for X days" but X was the node's total uptime, not the idle duration. We don't know when the node became idle — only that it currently has zero GPU pods. Changed wording to "Node up X days with 0 GPU pods scheduled."
Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).
Add multi-account AWS scanning
Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.
Add spot instance recommendations
Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.
Add K8s GPU metrics collection (CW, DCGM, Prometheus)
…ll as ToCSVRecords helper function
Collaborator
|
Thanks for your PR! Did you do some weird rebase and picked up the existing commits? Could make a new PR with a single commit for your feature please? I could have messed up your fork actually, as well. If you could re-create the fork and submit the new PR please? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.