Skip to content

Latest commit

 

History

History
286 lines (221 loc) · 10.3 KB

File metadata and controls

286 lines (221 loc) · 10.3 KB

gpuaudit

Scan your cloud for GPU waste and get actionable recommendations to cut your spend.

$ gpuaudit scan --skip-eks

  Found 38 GPU nodes across 47 nodes in gpu-cluster

  gpuaudit — GPU Cost Audit for AWS
  Account: 123456789012 | Regions: us-east-1 | Duration: 4.2s

  ┌──────────────────────────────────────────────────────────┐
  │  GPU Fleet Summary                                       │
  ├──────────────────────────────────────────────────────────┤
  │  Total GPU instances:       38                           │
  │  Total monthly GPU spend:  $127450                       │
  │  Estimated monthly waste:   $18200      (  14%)          │
  └──────────────────────────────────────────────────────────┘

  CRITICAL — 3 instance(s), $15400/mo potential savings

  Instance                             Type                       Monthly  Signal            Recommendation
  ──────────────────────────────────── ────────────────────────── ────────  ────────────────  ──────────────────────────────────────────────
  gpu-cluster/ip-10-15-255-248         g6e.16xlarge (1× L40S)     $  6752  idle              Node up 13 days with 0 GPU pods scheduled.
  gpu-cluster/ip-10-22-250-15          g6e.16xlarge (1× L40S)     $  6752  idle              Node up 1 days with 0 GPU pods scheduled.
  ...

What it scans

  • EC2 — GPU instances (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1) with CloudWatch metrics
  • SageMaker — Endpoints with GPU utilization and invocation metrics
  • EKS — Managed GPU node groups via the AWS EKS API
  • Kubernetes — GPU nodes and pod allocation via the Kubernetes API (Karpenter, self-managed, any CNI)

What it detects

  • Idle GPU instances — running but doing nothing (low CPU + near-zero network for 24+ hours)
  • Oversized GPU — multi-GPU instances where utilization suggests a single GPU would suffice
  • Pricing mismatch — on-demand instances running 30+ days that should be Reserved Instances
  • Stale instances — non-production instances running 90+ days
  • SageMaker low utilization — endpoints with <10% GPU utilization
  • SageMaker oversized — endpoints using <30% GPU memory on multi-GPU instances
  • K8s unallocated GPUs — nodes with GPU capacity but no pods requesting GPUs

Install

go install github.com/gpuaudit/cli/cmd/gpuaudit@latest

Or build from source:

git clone https://github.com/gpuaudit/cli.git
cd cli
go build -o gpuaudit ./cmd/gpuaudit

Quick start

# Uses default AWS credentials (~/.aws/credentials or environment variables)
gpuaudit scan

# Specific profile and region
gpuaudit scan --profile production --region us-east-1

# Kubernetes cluster scan (uses KUBECONFIG or ~/.kube/config)
gpuaudit scan --skip-eks

# Specific kubeconfig and context
gpuaudit scan --kubeconfig ~/.kube/config --kube-context gpu-cluster

# JSON output for automation
gpuaudit scan --format json -o report.json

# Compare two scans to see what changed
gpuaudit diff old-report.json new-report.json

# Slack Block Kit payload (pipe to webhook)
gpuaudit scan --format slack -o - | \
  curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK

# Skip specific scanners
gpuaudit scan --skip-metrics    # faster, less accurate
gpuaudit scan --skip-sagemaker
gpuaudit scan --skip-eks        # skip AWS EKS API (use --skip-k8s for Kubernetes API)
gpuaudit scan --skip-k8s

Comparing scans

Save scan results as JSON, then diff them later:

gpuaudit scan --format json -o scan-apr-08.json
# ... time passes, changes happen ...
gpuaudit scan --format json -o scan-apr-15.json
gpuaudit diff scan-apr-08.json scan-apr-15.json
  gpuaudit diff — 2026-04-08 12:00 UTC → 2026-04-15 12:00 UTC

  ┌──────────────────────────────────────────────────────────┐
  │  Cost Delta                                              │
  ├──────────────────────────────────────────────────────────┤
  │  Monthly spend:   $142000    → $127450    (-$14550)      │
  │  Estimated waste:  $31000    → $18200     (-$12800)      │
  │  Instances:        45 → 38   (-9 removed, +2 added)     │
  └──────────────────────────────────────────────────────────┘

  REMOVED — 9 instance(s), -$16200/mo
  ...

Matches instances by ID. Reports added, removed, and changed instances with per-field diffs (instance type, pricing model, cost, state, GPU allocation, waste severity).

Multi-Account Scanning

Scan multiple AWS accounts in a single invocation using STS AssumeRole.

Prerequisites

Deploy a read-only IAM role (gpuaudit-reader) to each target account. See Cross-Account Role Setup below.

Usage

# Scan specific accounts
gpuaudit scan --targets 111111111111,222222222222 --role gpuaudit-reader

# Scan entire AWS Organization
gpuaudit scan --org --role gpuaudit-reader

# Exclude management account
gpuaudit scan --org --role gpuaudit-reader --skip-self

# With external ID
gpuaudit scan --targets 111111111111 --role gpuaudit-reader --external-id my-secret

Cross-Account Role Setup

Terraform

variable "management_account_id" {
  description = "AWS account ID where gpuaudit runs"
  type        = string
}

resource "aws_iam_role" "gpuaudit_reader" {
  name = "gpuaudit-reader"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { AWS = "arn:aws:iam::${var.management_account_id}:root" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "gpuaudit_reader" {
  name   = "gpuaudit-policy"
  role   = aws_iam_role.gpuaudit_reader.id
  policy = file("gpuaudit-policy.json")  # from: gpuaudit iam-policy > gpuaudit-policy.json
}

Deploy to all accounts using Terraform workspaces or CloudFormation StackSets.

CloudFormation StackSet

AWSTemplateFormatVersion: "2010-09-09"
Parameters:
  ManagementAccountId:
    Type: String
Resources:
  GpuAuditRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: gpuaudit-reader
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${ManagementAccountId}:root"
            Action: sts:AssumeRole
      Policies:
        - PolicyName: gpuaudit-policy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeInstanceTypes
                  - ec2:DescribeRegions
                  - sagemaker:ListEndpoints
                  - sagemaker:DescribeEndpoint
                  - sagemaker:DescribeEndpointConfig
                  - eks:ListClusters
                  - eks:ListNodegroups
                  - eks:DescribeNodegroup
                  - cloudwatch:GetMetricData
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:ListMetrics
                  - ce:GetCostAndUsage
                  - ce:GetReservationUtilization
                  - ce:GetSavingsPlansUtilization
                  - pricing:GetProducts
                Resource: "*"

IAM permissions

gpuaudit is read-only. It never modifies your infrastructure. Generate the minimal IAM policy:

gpuaudit iam-policy

For Kubernetes scanning, gpuaudit needs get/list on nodes and pods cluster-wide.

GPU pricing reference

# List all GPU instance pricing
gpuaudit pricing

# Filter by GPU model
gpuaudit pricing --gpu H100
gpuaudit pricing --gpu L4

Output formats

Format Flag Use case
Table --format table (default) Terminal viewing
JSON --format json Automation, CI/CD, gpuaudit diff
Markdown --format markdown PRs, wikis, docs
Slack --format slack Slack webhook integration

How it works

  1. Discovery — Scans EC2, SageMaker, EKS node groups, and Kubernetes API across multiple regions for GPU resources
  2. Metrics — Collects 7-day CloudWatch metrics: CPU, network I/O for EC2; GPU utilization, GPU memory, invocations for SageMaker
  3. K8s allocation — Lists pods requesting nvidia.com/gpu resources and maps them to nodes
  4. Analysis — Applies 7 waste detection rules with severity levels (critical/warning/info)
  5. Recommendations — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings

Project structure

gpuaudit/
├── cmd/gpuaudit/          CLI entry point (cobra)
├── internal/
│   ├── models/            Core data types (GPUInstance, WasteSignal, Recommendation)
│   ├── pricing/           Bundled GPU pricing database (40+ instance types)
│   ├── analysis/          Waste detection rules engine (7 rules)
│   ├── diff/              Scan comparison logic
│   ├── output/            Formatters (table, JSON, markdown, Slack, diff)
│   └── providers/
│       ├── aws/           EC2, SageMaker, EKS, CloudWatch, Cost Explorer
│       └── k8s/           Kubernetes API GPU node/pod discovery
└── LICENSE                Apache 2.0

Roadmap

  • DCGM GPU metrics via Kubernetes (actual GPU utilization, not just allocation)
  • SageMaker training job analysis
  • Multi-account (AWS Organizations) scanning
  • GCP + Azure support
  • GitHub Action for scheduled scans

License

Apache 2.0