Scan your cloud for GPU waste and get actionable recommendations to cut your spend.
$ gpuaudit scan --skip-eks
Found 38 GPU nodes across 47 nodes in gpu-cluster
gpuaudit — GPU Cost Audit for AWS
Account: 123456789012 | Regions: us-east-1 | Duration: 4.2s
┌──────────────────────────────────────────────────────────┐
│ GPU Fleet Summary │
├──────────────────────────────────────────────────────────┤
│ Total GPU instances: 38 │
│ Total monthly GPU spend: $127450 │
│ Estimated monthly waste: $18200 ( 14%) │
└──────────────────────────────────────────────────────────┘
CRITICAL — 3 instance(s), $15400/mo potential savings
Instance Type Monthly Signal Recommendation
──────────────────────────────────── ────────────────────────── ──────── ──────────────── ──────────────────────────────────────────────
gpu-cluster/ip-10-15-255-248 g6e.16xlarge (1× L40S) $ 6752 idle Node up 13 days with 0 GPU pods scheduled.
gpu-cluster/ip-10-22-250-15 g6e.16xlarge (1× L40S) $ 6752 idle Node up 1 days with 0 GPU pods scheduled.
...
- EC2 — GPU instances (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1) with CloudWatch metrics
- SageMaker — Endpoints with GPU utilization and invocation metrics
- EKS — Managed GPU node groups via the AWS EKS API
- Kubernetes — GPU nodes and pod allocation via the Kubernetes API (Karpenter, self-managed, any CNI)
- Idle GPU instances — running but doing nothing (low CPU + near-zero network for 24+ hours)
- Oversized GPU — multi-GPU instances where utilization suggests a single GPU would suffice
- Pricing mismatch — on-demand instances running 30+ days that should be Reserved Instances
- Stale instances — non-production instances running 90+ days
- SageMaker low utilization — endpoints with <10% GPU utilization
- SageMaker oversized — endpoints using <30% GPU memory on multi-GPU instances
- K8s unallocated GPUs — nodes with GPU capacity but no pods requesting GPUs
go install github.com/gpuaudit/cli/cmd/gpuaudit@latestOr build from source:
git clone https://github.com/gpuaudit/cli.git
cd cli
go build -o gpuaudit ./cmd/gpuaudit# Uses default AWS credentials (~/.aws/credentials or environment variables)
gpuaudit scan
# Specific profile and region
gpuaudit scan --profile production --region us-east-1
# Kubernetes cluster scan (uses KUBECONFIG or ~/.kube/config)
gpuaudit scan --skip-eks
# Specific kubeconfig and context
gpuaudit scan --kubeconfig ~/.kube/config --kube-context gpu-cluster
# JSON output for automation
gpuaudit scan --format json -o report.json
# Compare two scans to see what changed
gpuaudit diff old-report.json new-report.json
# Slack Block Kit payload (pipe to webhook)
gpuaudit scan --format slack -o - | \
curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK
# Skip specific scanners
gpuaudit scan --skip-metrics # faster, less accurate
gpuaudit scan --skip-sagemaker
gpuaudit scan --skip-eks # skip AWS EKS API (use --skip-k8s for Kubernetes API)
gpuaudit scan --skip-k8sSave scan results as JSON, then diff them later:
gpuaudit scan --format json -o scan-apr-08.json
# ... time passes, changes happen ...
gpuaudit scan --format json -o scan-apr-15.json
gpuaudit diff scan-apr-08.json scan-apr-15.json gpuaudit diff — 2026-04-08 12:00 UTC → 2026-04-15 12:00 UTC
┌──────────────────────────────────────────────────────────┐
│ Cost Delta │
├──────────────────────────────────────────────────────────┤
│ Monthly spend: $142000 → $127450 (-$14550) │
│ Estimated waste: $31000 → $18200 (-$12800) │
│ Instances: 45 → 38 (-9 removed, +2 added) │
└──────────────────────────────────────────────────────────┘
REMOVED — 9 instance(s), -$16200/mo
...
Matches instances by ID. Reports added, removed, and changed instances with per-field diffs (instance type, pricing model, cost, state, GPU allocation, waste severity).
Scan multiple AWS accounts in a single invocation using STS AssumeRole.
Deploy a read-only IAM role (gpuaudit-reader) to each target account. See Cross-Account Role Setup below.
# Scan specific accounts
gpuaudit scan --targets 111111111111,222222222222 --role gpuaudit-reader
# Scan entire AWS Organization
gpuaudit scan --org --role gpuaudit-reader
# Exclude management account
gpuaudit scan --org --role gpuaudit-reader --skip-self
# With external ID
gpuaudit scan --targets 111111111111 --role gpuaudit-reader --external-id my-secretvariable "management_account_id" {
description = "AWS account ID where gpuaudit runs"
type = string
}
resource "aws_iam_role" "gpuaudit_reader" {
name = "gpuaudit-reader"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { AWS = "arn:aws:iam::${var.management_account_id}:root" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "gpuaudit_reader" {
name = "gpuaudit-policy"
role = aws_iam_role.gpuaudit_reader.id
policy = file("gpuaudit-policy.json") # from: gpuaudit iam-policy > gpuaudit-policy.json
}Deploy to all accounts using Terraform workspaces or CloudFormation StackSets.
AWSTemplateFormatVersion: "2010-09-09"
Parameters:
ManagementAccountId:
Type: String
Resources:
GpuAuditRole:
Type: AWS::IAM::Role
Properties:
RoleName: gpuaudit-reader
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS: !Sub "arn:aws:iam::${ManagementAccountId}:root"
Action: sts:AssumeRole
Policies:
- PolicyName: gpuaudit-policy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- ec2:DescribeInstanceTypes
- ec2:DescribeRegions
- sagemaker:ListEndpoints
- sagemaker:DescribeEndpoint
- sagemaker:DescribeEndpointConfig
- eks:ListClusters
- eks:ListNodegroups
- eks:DescribeNodegroup
- cloudwatch:GetMetricData
- cloudwatch:GetMetricStatistics
- cloudwatch:ListMetrics
- ce:GetCostAndUsage
- ce:GetReservationUtilization
- ce:GetSavingsPlansUtilization
- pricing:GetProducts
Resource: "*"gpuaudit is read-only. It never modifies your infrastructure. Generate the minimal IAM policy:
gpuaudit iam-policyFor Kubernetes scanning, gpuaudit needs get/list on nodes and pods cluster-wide.
# List all GPU instance pricing
gpuaudit pricing
# Filter by GPU model
gpuaudit pricing --gpu H100
gpuaudit pricing --gpu L4| Format | Flag | Use case |
|---|---|---|
| Table | --format table (default) |
Terminal viewing |
| JSON | --format json |
Automation, CI/CD, gpuaudit diff |
| Markdown | --format markdown |
PRs, wikis, docs |
| Slack | --format slack |
Slack webhook integration |
- Discovery — Scans EC2, SageMaker, EKS node groups, and Kubernetes API across multiple regions for GPU resources
- Metrics — Collects 7-day CloudWatch metrics: CPU, network I/O for EC2; GPU utilization, GPU memory, invocations for SageMaker
- K8s allocation — Lists pods requesting
nvidia.com/gpuresources and maps them to nodes - Analysis — Applies 7 waste detection rules with severity levels (critical/warning/info)
- Recommendations — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings
gpuaudit/
├── cmd/gpuaudit/ CLI entry point (cobra)
├── internal/
│ ├── models/ Core data types (GPUInstance, WasteSignal, Recommendation)
│ ├── pricing/ Bundled GPU pricing database (40+ instance types)
│ ├── analysis/ Waste detection rules engine (7 rules)
│ ├── diff/ Scan comparison logic
│ ├── output/ Formatters (table, JSON, markdown, Slack, diff)
│ └── providers/
│ ├── aws/ EC2, SageMaker, EKS, CloudWatch, Cost Explorer
│ └── k8s/ Kubernetes API GPU node/pod discovery
└── LICENSE Apache 2.0
- DCGM GPU metrics via Kubernetes (actual GPU utilization, not just allocation)
- SageMaker training job analysis
- Multi-account (AWS Organizations) scanning
- GCP + Azure support
- GitHub Action for scheduled scans
Apache 2.0