Skip to content

gpuaudit/cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpuaudit

Scan your cloud for GPU waste and get actionable recommendations to cut your spend.

$ gpuaudit scan --skip-eks

  Found 38 GPU nodes across 47 nodes in gpu-cluster

  gpuaudit — GPU Cost Audit for AWS
  Account: 123456789012 | Regions: us-east-1 | Duration: 4.2s

  ┌──────────────────────────────────────────────────────────┐
  │  GPU Fleet Summary                                       │
  ├──────────────────────────────────────────────────────────┤
  │  Total GPU instances:       38                           │
  │  Total monthly GPU spend:  $127450                       │
  │  Estimated monthly waste:   $18200      (  14%)          │
  └──────────────────────────────────────────────────────────┘

  CRITICAL — 3 instance(s), $15400/mo potential savings

  Instance                             Type                       Monthly  Signal            Recommendation
  ──────────────────────────────────── ────────────────────────── ────────  ────────────────  ──────────────────────────────────────────────
  gpu-cluster/ip-10-15-255-248         g6e.16xlarge (1× L40S)     $  6752  idle              Node up 13 days with 0 GPU pods scheduled.
  gpu-cluster/ip-10-22-250-15          g6e.16xlarge (1× L40S)     $  6752  idle              Node up 1 days with 0 GPU pods scheduled.
  ...

What it scans

  • EC2 — GPU instances (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1) with CloudWatch metrics
  • SageMaker — Endpoints with GPU utilization and invocation metrics
  • EKS — Managed GPU node groups via the AWS EKS API
  • Kubernetes — GPU nodes and pod allocation via the Kubernetes API (Karpenter, self-managed, any CNI)

What it detects

  • Idle GPU instances — running but doing nothing (low CPU + near-zero network for 24+ hours)
  • Oversized GPU — multi-GPU instances where utilization suggests a single GPU would suffice
  • Pricing mismatch — on-demand instances running 30+ days that should be Reserved Instances
  • Stale instances — non-production instances running 90+ days
  • SageMaker low utilization — endpoints with <10% GPU utilization
  • SageMaker oversized — endpoints using <30% GPU memory on multi-GPU instances
  • K8s unallocated GPUs — nodes with GPU capacity but no pods requesting GPUs

Install

go install github.com/gpuaudit/cli/cmd/gpuaudit@latest

Or build from source:

git clone https://github.com/gpuaudit/cli.git
cd cli
go build -o gpuaudit ./cmd/gpuaudit

Quick start

# Uses default AWS credentials (~/.aws/credentials or environment variables)
gpuaudit scan

# Specific profile and region
gpuaudit scan --profile production --region us-east-1

# Kubernetes cluster scan (uses KUBECONFIG or ~/.kube/config)
gpuaudit scan --skip-eks

# Specific kubeconfig and context
gpuaudit scan --kubeconfig ~/.kube/config --kube-context gpu-cluster

# JSON output for automation
gpuaudit scan --format json -o report.json

# Compare two scans to see what changed
gpuaudit diff old-report.json new-report.json

# Slack Block Kit payload (pipe to webhook)
gpuaudit scan --format slack -o - | \
  curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK

# Skip specific scanners
gpuaudit scan --skip-metrics    # faster, less accurate
gpuaudit scan --skip-sagemaker
gpuaudit scan --skip-eks        # skip AWS EKS API (use --skip-k8s for Kubernetes API)
gpuaudit scan --skip-k8s

Comparing scans

Save scan results as JSON, then diff them later:

gpuaudit scan --format json -o scan-apr-08.json
# ... time passes, changes happen ...
gpuaudit scan --format json -o scan-apr-15.json
gpuaudit diff scan-apr-08.json scan-apr-15.json
  gpuaudit diff — 2026-04-08 12:00 UTC → 2026-04-15 12:00 UTC

  ┌──────────────────────────────────────────────────────────┐
  │  Cost Delta                                              │
  ├──────────────────────────────────────────────────────────┤
  │  Monthly spend:   $142000    → $127450    (-$14550)      │
  │  Estimated waste:  $31000    → $18200     (-$12800)      │
  │  Instances:        45 → 38   (-9 removed, +2 added)     │
  └──────────────────────────────────────────────────────────┘

  REMOVED — 9 instance(s), -$16200/mo
  ...

Matches instances by ID. Reports added, removed, and changed instances with per-field diffs (instance type, pricing model, cost, state, GPU allocation, waste severity).

Multi-Account Scanning

Scan multiple AWS accounts in a single invocation using STS AssumeRole.

Prerequisites

Deploy a read-only IAM role (gpuaudit-reader) to each target account. See Cross-Account Role Setup below.

Usage

# Scan specific accounts
gpuaudit scan --targets 111111111111,222222222222 --role gpuaudit-reader

# Scan entire AWS Organization
gpuaudit scan --org --role gpuaudit-reader

# Exclude management account
gpuaudit scan --org --role gpuaudit-reader --skip-self

# With external ID
gpuaudit scan --targets 111111111111 --role gpuaudit-reader --external-id my-secret

Cross-Account Role Setup

Terraform

variable "management_account_id" {
  description = "AWS account ID where gpuaudit runs"
  type        = string
}

resource "aws_iam_role" "gpuaudit_reader" {
  name = "gpuaudit-reader"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { AWS = "arn:aws:iam::${var.management_account_id}:root" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "gpuaudit_reader" {
  name   = "gpuaudit-policy"
  role   = aws_iam_role.gpuaudit_reader.id
  policy = file("gpuaudit-policy.json")  # from: gpuaudit iam-policy > gpuaudit-policy.json
}

Deploy to all accounts using Terraform workspaces or CloudFormation StackSets.

CloudFormation StackSet

AWSTemplateFormatVersion: "2010-09-09"
Parameters:
  ManagementAccountId:
    Type: String
Resources:
  GpuAuditRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: gpuaudit-reader
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${ManagementAccountId}:root"
            Action: sts:AssumeRole
      Policies:
        - PolicyName: gpuaudit-policy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeInstanceTypes
                  - ec2:DescribeRegions
                  - sagemaker:ListEndpoints
                  - sagemaker:DescribeEndpoint
                  - sagemaker:DescribeEndpointConfig
                  - eks:ListClusters
                  - eks:ListNodegroups
                  - eks:DescribeNodegroup
                  - cloudwatch:GetMetricData
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:ListMetrics
                  - ce:GetCostAndUsage
                  - ce:GetReservationUtilization
                  - ce:GetSavingsPlansUtilization
                  - pricing:GetProducts
                Resource: "*"

IAM permissions

gpuaudit is read-only. It never modifies your infrastructure. Generate the minimal IAM policy:

gpuaudit iam-policy

For Kubernetes scanning, gpuaudit needs get/list on nodes and pods cluster-wide.

GPU pricing reference

# List all GPU instance pricing
gpuaudit pricing

# Filter by GPU model
gpuaudit pricing --gpu H100
gpuaudit pricing --gpu L4

Output formats

Format Flag Use case
Table --format table (default) Terminal viewing
JSON --format json Automation, CI/CD, gpuaudit diff
Markdown --format markdown PRs, wikis, docs
Slack --format slack Slack webhook integration

How it works

  1. Discovery — Scans EC2, SageMaker, EKS node groups, and Kubernetes API across multiple regions for GPU resources
  2. Metrics — Collects 7-day CloudWatch metrics: CPU, network I/O for EC2; GPU utilization, GPU memory, invocations for SageMaker
  3. K8s allocation — Lists pods requesting nvidia.com/gpu resources and maps them to nodes
  4. Analysis — Applies 7 waste detection rules with severity levels (critical/warning/info)
  5. Recommendations — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings

Project structure

gpuaudit/
├── cmd/gpuaudit/          CLI entry point (cobra)
├── internal/
│   ├── models/            Core data types (GPUInstance, WasteSignal, Recommendation)
│   ├── pricing/           Bundled GPU pricing database (40+ instance types)
│   ├── analysis/          Waste detection rules engine (7 rules)
│   ├── diff/              Scan comparison logic
│   ├── output/            Formatters (table, JSON, markdown, Slack, diff)
│   └── providers/
│       ├── aws/           EC2, SageMaker, EKS, CloudWatch, Cost Explorer
│       └── k8s/           Kubernetes API GPU node/pod discovery
└── LICENSE                Apache 2.0

Roadmap

  • DCGM GPU metrics via Kubernetes (actual GPU utilization, not just allocation)
  • SageMaker training job analysis
  • Multi-account (AWS Organizations) scanning
  • GCP + Azure support
  • GitHub Action for scheduled scans

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors