gpuaudit/feature/add_csv_output_format by sospeter-57 · Pull Request #28 · gpuaudit/cli

sospeter-57 · 2026-04-22T01:16:39Z

No description provided.

Removed reference to ARCHITECTURE.md from the README.

SmallerAlternatives now prefers same-family instances first (e.g. g6e.xlarge for g6e.12xlarge), then same-GPU-model, then others. Previously it picked the globally cheapest single-GPU which could recommend a T4 to replace an L40S. Table columns widened to show more of instance names, types, and recommendation text from real-world scan output.

…s below threshold

Scans EKS clusters for managed node groups running GPU instance types. Adds --skip-eks flag and EKS IAM permissions to iam-policy output. Closes #1

Scans K8s clusters via kubeconfig to find nodes with nvidia.com/gpu allocatable resources and pods requesting GPUs. Reports idle GPU nodes (no pods scheduled) and partially allocated nodes as waste signals. Adds --kubeconfig, --kube-context, and --skip-k8s flags. AWS scan failure is now non-fatal when K8s scan is enabled. Refs #1

Strip domain suffix (.ec2.internal etc.) from node names in output for readability.

When instance type is not in the pricing DB, check karpenter.k8s.aws/instance-gpu-name and nvidia.com/gpu.product node labels to identify the GPU model.

Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.

gpuaudit diff old.json new.json [--format table|json] Closes #5

The recommendation said "No GPU pods scheduled for X days" but X was the node's total uptime, not the idle duration. We don't know when the node became idle — only that it currently has zero GPU pods. Changed wording to "Node up X days with 0 GPU pods scheduled."

Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).

…d div-by-zero

Add multi-account AWS scanning

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.

DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.

Add spot instance recommendations

Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.

Add K8s GPU metrics collection (CW, DCGM, Prometheus)

…ll as ToCSVRecords helper function

maksimov · 2026-04-22T13:20:07Z

Thanks for your PR! Did you do some weird rebase and picked up the existing commits? Could make a new PR with a single commit for your feature please?

I could have messed up your fork actually, as well. If you could re-create the fork and submit the new PR please?

maksimov and others added 30 commits April 4, 2026 18:55

Remove ARCHITECTURE.md reference from README

e8d4c49

Removed reference to ARCHITECTURE.md from the README.

Add Makefile with cross-compilation targets

d10acc0

Wrap long recommendation text instead of truncating

5c8f3b0

Let recommendation text flow without wrapping

ced3665

Add deploy script and cross-compiled binaries to gitignore

0d335c1

Send progress and warning messages to stderr instead of stdout

d83ef4d

Add --exclude-tag flag to filter out instances by tag

8f2abe5

Add --min-idle-days to filter out recently idle instances

6ec9394

Fix --min-idle-days to strip idle signals from multi-signal instances

697d80d

Replace --min-idle-days with --min-uptime-days to suppress all signal…

5618b00

…s below threshold

Add Apache 2.0 copyright headers to all source files

4efc6a8

Update GitHub repository links in README

7570f45

Move module path to github.com/gpuaudit/gpuaudit

296cc65

Strip debug symbols to reduce binary size by ~35%

cb8d08e

Rename module path to github.com/gpuaudit/cli

debdd3f

Make EC2 discovery failure non-fatal so SageMaker scan can still proceed

2983348

Add EKS GPU node group discovery

efab177

Scans EKS clusters for managed node groups running GPU instance types. Adds --skip-eks flag and EKS IAM permissions to iam-policy output. Closes #1

Shorten K8s node names to hostname only

e5678f4

Strip domain suffix (.ec2.internal etc.) from node names in output for readability.

Fall back to Karpenter and GPU Operator labels for GPU model

308795a

When instance type is not in the pricing DB, check karpenter.k8s.aws/instance-gpu-name and nvidia.com/gpu.product node labels to identify the GPU model.

Add diff command design spec and implementation plan

a630aa0

Add diff package with Compare function and tests

f963144

Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.

Add diff table and JSON output formatters

cc63318

Add diff subcommand to compare two scan results

68abdfa

gpuaudit diff old.json new.json [--format table|json] Closes #5

Fix box alignment in diff table output

de3487f

Update README with K8s scanning, diff command, and current output format

39a4926

Add multi-target scanning design spec

60cf644

Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).

Add multi-target scanning implementation plan

0330be7

maksimov and others added 25 commits April 19, 2026 21:15

Implement EnrichSpotPrices with DescribeSpotPriceHistory

8acbdf2

Wire EnrichSpotPrices into scanRegion after EC2 discovery

0b2bbf5

Correct spot instance cost using live spot prices

d29c126

Add ruleSpotEligible analysis rule for spot recommendations

c8f4330

Add ec2:DescribeSpotPriceHistory to IAM policy output

cb18d63

Address review: update signal type comment, add pagination note, guar…

6e39bbb

…d div-by-zero

Merge pull request #18 from gpuaudit/feature/multi-account-scanning

2043078

Add multi-account AWS scanning

Add K8s GPU metrics collection design spec

22cf265

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Add K8s GPU metrics collection implementation plan

ee8e309

Add EnrichK8sGPUMetrics for CloudWatch Container Insights GPU metrics

879f2c1

Add ProxyGet to K8sClient interface for pod API proxy

9a176fa

Add Prometheus query enrichment for K8s GPU metrics

98003ef

Add ruleK8sLowGPUUtil for utilization-based K8s GPU waste detection

1a35f95

Wire K8s GPU metrics fallback chain into CLI scan flow

89d9cb3

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

Include time window in low GPU utilization recommendation text

fa00dff

Skip CW enrichment when AWS creds unavailable, reduce DCGM noise

51db9f4

Merge pull request #19 from gpuaudit/feature/spot-recommendations

2fcb210

Add spot instance recommendations

Merge origin/master into feature/k8s-gpu-metrics

0a1b2a1

Resolve conflicts between K8s GPU metrics (rules 7-9, CLI wiring) and master additions (spot recommendations, multi-account scanning, diff command). Keep both ruleSpotEligible and ruleK8sLowGPUUtil with their full test suites.

Merge pull request #21 from gpuaudit/feature/k8s-gpu-metrics

29c8fcc

Add K8s GPU metrics collection (CW, DCGM, Prometheus)

Add support for csv output format. Implementation for FormatCSV as we…

fd998a9

…ll as ToCSVRecords helper function

Add CSV output formatter and tests

17900fa

Merge upstream/master into feature branch

4a7e24a

sospeter-57 closed this Apr 22, 2026

sospeter-57 reopened this Apr 22, 2026

sospeter-57 changed the title ~~Gpuaudit/feature/add csv output format~~ gpuaudit/feature/add_csv_output_format Apr 22, 2026

sospeter-57 closed this by deleting the head repository Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpuaudit/feature/add_csv_output_format#28

gpuaudit/feature/add_csv_output_format#28
sospeter-57 wants to merge 67 commits intogpuaudit:masterfrom
sospeter-57:gpuaudit/feature/Add_CSV_output_format

sospeter-57 commented Apr 22, 2026

Uh oh!

maksimov commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sospeter-57 commented Apr 22, 2026

Uh oh!

maksimov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maksimov commented Apr 22, 2026 •

edited

Loading