gpuaudit

Scan your cloud for GPU waste and get actionable recommendations to cut your spend.

$ gpuaudit scan --skip-eks

  Found 38 GPU nodes across 47 nodes in gpu-cluster

  gpuaudit — GPU Cost Audit for AWS
  Account: 123456789012 | Regions: us-east-1 | Duration: 4.2s

  ┌──────────────────────────────────────────────────────────┐
  │  GPU Fleet Summary                                       │
  ├──────────────────────────────────────────────────────────┤
  │  Total GPU instances:       38                           │
  │  Total monthly GPU spend:  $127450                       │
  │  Estimated monthly waste:   $18200      (  14%)          │
  └──────────────────────────────────────────────────────────┘

  CRITICAL — 3 instance(s), $15400/mo potential savings

  Instance                             Type                       Monthly  Signal            Recommendation
  ──────────────────────────────────── ────────────────────────── ────────  ────────────────  ──────────────────────────────────────────────
  gpu-cluster/ip-10-15-255-248         g6e.16xlarge (1× L40S)     $  6752  idle              Node up 13 days with 0 GPU pods scheduled.
  gpu-cluster/ip-10-22-250-15          g6e.16xlarge (1× L40S)     $  6752  idle              Node up 1 days with 0 GPU pods scheduled.
  ...

What it scans

EC2 — GPU instances (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1) with CloudWatch metrics
SageMaker — Endpoints with GPU utilization and invocation metrics
EKS — Managed GPU node groups via the AWS EKS API
Kubernetes — GPU nodes and pod allocation via the Kubernetes API (Karpenter, self-managed, any CNI)

What it detects

Idle GPU instances — running but doing nothing (low CPU + near-zero network for 24+ hours)
Oversized GPU — multi-GPU instances where utilization suggests a single GPU would suffice
Pricing mismatch — on-demand instances running 30+ days that should be Reserved Instances
Stale instances — non-production instances running 90+ days
SageMaker low utilization — endpoints with <10% GPU utilization
SageMaker oversized — endpoints using <30% GPU memory on multi-GPU instances
K8s unallocated GPUs — nodes with GPU capacity but no pods requesting GPUs

Install

go install github.com/gpuaudit/cli/cmd/gpuaudit@latest

Or build from source:

git clone https://github.com/gpuaudit/cli.git
cd cli
go build -o gpuaudit ./cmd/gpuaudit

Quick start

# Uses default AWS credentials (~/.aws/credentials or environment variables)
gpuaudit scan

# Specific profile and region
gpuaudit scan --profile production --region us-east-1

# Kubernetes cluster scan (uses KUBECONFIG or ~/.kube/config)
gpuaudit scan --skip-eks

# Specific kubeconfig and context
gpuaudit scan --kubeconfig ~/.kube/config --kube-context gpu-cluster

# JSON output for automation
gpuaudit scan --format json -o report.json

# Compare two scans to see what changed
gpuaudit diff old-report.json new-report.json

# Slack Block Kit payload (pipe to webhook)
gpuaudit scan --format slack -o - | \
  curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK

# Skip specific scanners
gpuaudit scan --skip-metrics    # faster, less accurate
gpuaudit scan --skip-sagemaker
gpuaudit scan --skip-eks        # skip AWS EKS API (use --skip-k8s for Kubernetes API)
gpuaudit scan --skip-k8s

Comparing scans

Save scan results as JSON, then diff them later:

gpuaudit scan --format json -o scan-apr-08.json
# ... time passes, changes happen ...
gpuaudit scan --format json -o scan-apr-15.json
gpuaudit diff scan-apr-08.json scan-apr-15.json

  gpuaudit diff — 2026-04-08 12:00 UTC → 2026-04-15 12:00 UTC

  ┌──────────────────────────────────────────────────────────┐
  │  Cost Delta                                              │
  ├──────────────────────────────────────────────────────────┤
  │  Monthly spend:   $142000    → $127450    (-$14550)      │
  │  Estimated waste:  $31000    → $18200     (-$12800)      │
  │  Instances:        45 → 38   (-9 removed, +2 added)     │
  └──────────────────────────────────────────────────────────┘

  REMOVED — 9 instance(s), -$16200/mo
  ...

Matches instances by ID. Reports added, removed, and changed instances with per-field diffs (instance type, pricing model, cost, state, GPU allocation, waste severity).

Multi-Account Scanning

Scan multiple AWS accounts in a single invocation using STS AssumeRole.

Prerequisites

Deploy a read-only IAM role (gpuaudit-reader) to each target account. See Cross-Account Role Setup below.

Usage

# Scan specific accounts
gpuaudit scan --targets 111111111111,222222222222 --role gpuaudit-reader

# Scan entire AWS Organization
gpuaudit scan --org --role gpuaudit-reader

# Exclude management account
gpuaudit scan --org --role gpuaudit-reader --skip-self

# With external ID
gpuaudit scan --targets 111111111111 --role gpuaudit-reader --external-id my-secret

Cross-Account Role Setup

Terraform

variable "management_account_id" {
  description = "AWS account ID where gpuaudit runs"
  type        = string
}

resource "aws_iam_role" "gpuaudit_reader" {
  name = "gpuaudit-reader"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { AWS = "arn:aws:iam::${var.management_account_id}:root" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "gpuaudit_reader" {
  name   = "gpuaudit-policy"
  role   = aws_iam_role.gpuaudit_reader.id
  policy = file("gpuaudit-policy.json")  # from: gpuaudit iam-policy > gpuaudit-policy.json
}

Deploy to all accounts using Terraform workspaces or CloudFormation StackSets.

CloudFormation StackSet

AWSTemplateFormatVersion: "2010-09-09"
Parameters:
  ManagementAccountId:
    Type: String
Resources:
  GpuAuditRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: gpuaudit-reader
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${ManagementAccountId}:root"
            Action: sts:AssumeRole
      Policies:
        - PolicyName: gpuaudit-policy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeInstanceTypes
                  - ec2:DescribeRegions
                  - sagemaker:ListEndpoints
                  - sagemaker:DescribeEndpoint
                  - sagemaker:DescribeEndpointConfig
                  - eks:ListClusters
                  - eks:ListNodegroups
                  - eks:DescribeNodegroup
                  - cloudwatch:GetMetricData
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:ListMetrics
                  - ce:GetCostAndUsage
                  - ce:GetReservationUtilization
                  - ce:GetSavingsPlansUtilization
                  - pricing:GetProducts
                Resource: "*"

IAM permissions

gpuaudit is read-only. It never modifies your infrastructure. Generate the minimal IAM policy:

gpuaudit iam-policy

For Kubernetes scanning, gpuaudit needs get/list on nodes and pods cluster-wide.

GPU pricing reference

# List all GPU instance pricing
gpuaudit pricing

# Filter by GPU model
gpuaudit pricing --gpu H100
gpuaudit pricing --gpu L4

Output formats

Format	Flag	Use case
Table	`--format table` (default)	Terminal viewing
JSON	`--format json`	Automation, CI/CD, `gpuaudit diff`
Markdown	`--format markdown`	PRs, wikis, docs
Slack	`--format slack`	Slack webhook integration

How it works

Discovery — Scans EC2, SageMaker, EKS node groups, and Kubernetes API across multiple regions for GPU resources
Metrics — Collects 7-day CloudWatch metrics: CPU, network I/O for EC2; GPU utilization, GPU memory, invocations for SageMaker
K8s allocation — Lists pods requesting nvidia.com/gpu resources and maps them to nodes
Analysis — Applies 7 waste detection rules with severity levels (critical/warning/info)
Recommendations — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings

Project structure

gpuaudit/
├── cmd/gpuaudit/          CLI entry point (cobra)
├── internal/
│   ├── models/            Core data types (GPUInstance, WasteSignal, Recommendation)
│   ├── pricing/           Bundled GPU pricing database (40+ instance types)
│   ├── analysis/          Waste detection rules engine (7 rules)
│   ├── diff/              Scan comparison logic
│   ├── output/            Formatters (table, JSON, markdown, Slack, diff)
│   └── providers/
│       ├── aws/           EC2, SageMaker, EKS, CloudWatch, Cost Explorer
│       └── k8s/           Kubernetes API GPU node/pod discovery
└── LICENSE                Apache 2.0

Roadmap

DCGM GPU metrics via Kubernetes (actual GPU utilization, not just allocation)
SageMaker training job analysis
Multi-account (AWS Organizations) scanning
GCP + Azure support
GitHub Action for scheduled scans

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpuaudit

What it scans

What it detects

Install

Quick start

Comparing scans

Multi-Account Scanning

Prerequisites

Usage

Cross-Account Role Setup

Terraform

CloudFormation StackSet

IAM permissions

GPU pricing reference

Output formats

How it works

Project structure

Roadmap

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

gpuaudit

What it scans

What it detects

Install

Quick start

Comparing scans

Multi-Account Scanning

Prerequisites

Usage

Cross-Account Role Setup

Terraform

CloudFormation StackSet

IAM permissions

GPU pricing reference

Output formats

How it works

Project structure

Roadmap

License