Skip to content

gpuaudit/feature/add_csv_output_format#28

Closed
sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
sospeter-57:gpuaudit/feature/Add_CSV_output_format
Closed

gpuaudit/feature/add_csv_output_format#28
sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
sospeter-57:gpuaudit/feature/Add_CSV_output_format

Conversation

@sospeter-57
Copy link
Copy Markdown

No description provided.

maksimov and others added 30 commits April 4, 2026 18:55
Removed reference to ARCHITECTURE.md from the README.
SmallerAlternatives now prefers same-family instances first (e.g.
g6e.xlarge for g6e.12xlarge), then same-GPU-model, then others.
Previously it picked the globally cheapest single-GPU which could
recommend a T4 to replace an L40S.

Table columns widened to show more of instance names, types, and
recommendation text from real-world scan output.
Scans EKS clusters for managed node groups running GPU instance types.
Adds --skip-eks flag and EKS IAM permissions to iam-policy output.

Closes #1
Scans K8s clusters via kubeconfig to find nodes with nvidia.com/gpu
allocatable resources and pods requesting GPUs. Reports idle GPU nodes
(no pods scheduled) and partially allocated nodes as waste signals.

Adds --kubeconfig, --kube-context, and --skip-k8s flags. AWS scan
failure is now non-fatal when K8s scan is enabled.

Refs #1
Strip domain suffix (.ec2.internal etc.) from node names in output
for readability.
When instance type is not in the pricing DB, check
karpenter.k8s.aws/instance-gpu-name and nvidia.com/gpu.product
node labels to identify the GPU model.
Compares two scan results by instance ID. Detects added, removed,
and changed instances across 6 fields (instance type, pricing model,
cost, state, GPU allocation, waste severity). Computes cost deltas.
gpuaudit diff old.json new.json [--format table|json]

Closes #5
The recommendation said "No GPU pods scheduled for X days" but X was
the node's total uptime, not the idle duration. We don't know when
the node became idle — only that it currently has zero GPU pods.
Changed wording to "Node up X days with 0 GPU pods scheduled."
Covers CLI flags (--targets, --role, --org), architecture for
parallel cross-account scanning via STS AssumeRole, output changes
with per-target sub-summaries, and IAM role setup docs (Terraform
+ CloudFormation).
maksimov and others added 25 commits April 19, 2026 21:15
Three-source fallback chain: CloudWatch Container Insights,
DCGM exporter scrape, and Prometheus query. Per-node fallback
with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their
Prometheus metrics endpoint via kubectl proxy to populate GPU and
GPU memory utilization on K8s node instances. Skips nodes that
already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for
Prometheus GPU metrics. Orchestrate the 3-source fallback chain
(CloudWatch Container Insights → DCGM scrape → Prometheus) between
K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but
pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal)
while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to
GPUInstance and use it for DCGM matching.

Also stop retrying CW queries after the first error — all nodes will
get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently
(likely RBAC). Log one warning, bail after 3 consecutive failures.

Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL
node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label
matches K8s hostnames, not EC2 instance IDs.
Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and
master additions (spot recommendations, multi-account scanning, diff
command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their
full test suites.
Add K8s GPU metrics collection (CW, DCGM, Prometheus)
@sospeter-57 sospeter-57 reopened this Apr 22, 2026
@sospeter-57 sospeter-57 changed the title Gpuaudit/feature/add csv output format gpuaudit/feature/add_csv_output_format Apr 22, 2026
@maksimov
Copy link
Copy Markdown
Collaborator

maksimov commented Apr 22, 2026

Thanks for your PR! Did you do some weird rebase and picked up the existing commits? Could make a new PR with a single commit for your feature please?

I could have messed up your fork actually, as well. If you could re-create the fork and submit the new PR please?

@sospeter-57 sospeter-57 closed this by deleting the head repository Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants