All-in-one monitoring tool for GPU clusters without Kubernetes
Simplifies the installation, configuration, and operation of the Prometheus stack through a single CLI/UI, with GPU-specific diagnostic features
Setting up GPU cluster monitoring without K8s takes 2-3 days:
| Task | Traditional Approach | AAMI |
|---|---|---|
| Install Prometheus + Grafana + Alertmanager | Half day | Automatic |
| Deploy DCGM exporter | 2-3 hours | Automatic |
| Write alert rules | Half day (learning PromQL) | Presets provided |
| Slack/Email integration | 2-3 hours | CLI/UI configuration |
| Air-gap environment support | 1-2 days | Bundle provided |
| Total Time | 2-3 days | 30 minutes |
"Setting up Prometheus + Grafana + Alertmanager + DCGM on a GPU cluster takes 2-3 days.
With AAMI, it takes 30 minutes. Air-gap is also supported.
And when Xid 79 occurs, it tells you what it is, why it happened, and what to do."
# Online installation
curl -fsSL https://get.aami.dev | bash
aami init
# Air-gap installation
aami bundle create --output aami-offline.tar.gz # On internet-connected machine
aami init --offline ./aami-offline.tar.gz # On air-gapped machine# Add node
aami nodes add gpu-node-01 --ip 192.168.1.101 --user root --key ~/.ssh/id_rsa
# Bulk add
aami nodes add --file hosts.txt
# List nodes
aami nodes list
ββββββββββββββββ¬ββββββββββββββββ¬βββββββ¬βββββββββ¬ββββββββββ
β Name β IP β GPUs β Status β Alerts β
ββββββββββββββββΌββββββββββββββββΌβββββββΌβββββββββΌββββββββββ€
β gpu-node-01 β 192.168.1.101 β 8 β β
β 0 β
β gpu-node-02 β 192.168.1.102 β 8 β β οΈ β 1 β
ββββββββββββββββ΄ββββββββββββββββ΄βββββββ΄βββββββββ΄ββββββββββaami alerts apply-preset gpu-production
# β 8 alert rules applied instantly| Alert | Condition | Severity |
|---|---|---|
| GPU Temperature Overheat | temp > 85Β°C for 5 minutes | Critical |
| GPU Memory Leak | memory > 95% AND util < 5% | Warning |
| ECC Error Threshold | ECC errors > 100/24h | Critical |
| NVLink Error | NVLink error count increase | Warning |
| Xid Error Detected | Xid error detected | Critical |
| Node Down | node_exporter not responding | Critical |
aami explain xid 79
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Xid 79: GPU has fallen off the bus β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Severity: Critical β
β β
β Meaning: β
β GPU disconnected from PCIe bus. System cannot communicate β
β with the GPU. β
β β
β Common Causes: β
β 1. PCIe slot contact failure β
β 2. Unstable power supply β
β 3. GPU hardware defect β
β β
β Recommended Actions: β
β 1. Immediately remove the node from workload β
β 2. Attempt GPU reseat (reinstallation) β
β 3. Consider GPU replacement if issue recurs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββConfigure alerts with clicks instead of YAML editing:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π Alert Rules [+ New] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β GPU Temperature Critical β
β Condition: GPU temp > [85]Β°C for [5] minutes β
β Severity: [Critical βΌ] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π¬ Notification Channels [+ Add] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
Slack: #gpu-alerts [Test] [Edit]β
β β
Email: infra-team@company.com [Test] [Edit]β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Control Node β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββ β
β β AAMI CLI β β AAMI UI β β SSH Executor (Go) β β
β β β β (Web) β β - Parallel (100 conc.) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββββββββ¬βββββββββββββ β
β ββββββββββ¬ββββββββ β β
β βΌ β β
β βββββββββββββββββββββββββββββββββββ β β
β β config.yaml β β β
β β (File-based, no DB) β β β
β βββββββββββββββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Prometheus + Alertmanager + Grafana (Container/Binary) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β SSH (Install) / HTTP (Metrics)
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β GPU Node β β GPU Node β β GPU Node β
β 01 β β 02 β ... β N β
ββββββββββββ€ ββββββββββββ€ ββββββββββββ€
ββ’ node β ββ’ node β ββ’ node β
β exporterβ β exporterβ β exporterβ
ββ’ dcgm β ββ’ dcgm β ββ’ dcgm β
β exporterβ β exporterβ β exporterβ
ββββββββββββ ββββββββββββ ββββββββββββ
| Area | Choice | Reason |
|---|---|---|
| Node Access | SSH (Agentless) | Air-gap friendly, no additional agent required |
| Data Storage | YAML File | Simplified installation without DB, Git version control |
| GPU Metrics | DCGM Exporter | Official NVIDIA, detailed metrics |
- Control Node: Linux (Ubuntu 20.04+, RHEL 8+)
- GPU Nodes: SSH accessible, NVIDIA Driver 450.80+
- Optional: Docker or Podman (for container deployment)
# 1. Install AAMI
curl -fsSL https://get.aami.dev | bash
# 2. Initialize
aami init
# 3. Register nodes
cat << EOF > hosts.txt
gpu-node-01 192.168.1.101
gpu-node-02 192.168.1.102
gpu-node-03 192.168.1.103
EOF
aami nodes add --file hosts.txt --user root --key ~/.ssh/id_rsa
# 4. Apply alert preset
aami alerts apply-preset gpu-production
# 5. Configure notifications
aami config notifications slack --webhook https://hooks.slack.com/xxx
# 6. Check status
aami status# Create bundle on internet-connected machine
aami bundle create --output aami-offline-v1.0.0.tar.gz
# Install on air-gapped machine
aami init --offline ./aami-offline-v1.0.0.tar.gz# /etc/aami/config.yaml
cluster:
name: gpu-cluster-prod
nodes:
- name: gpu-node-01
ip: 192.168.1.101
ssh_user: root
ssh_key: /root/.ssh/id_rsa
labels:
gpu_type: a100
alerts:
presets:
- gpu-production
notifications:
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_URL}"
channel: "#gpu-alerts"
prometheus:
retention: 15d
storage_path: /var/lib/aami/prometheus| Feature | AAMI | kube-prometheus-stack | Ansible + Prometheus | Zabbix |
|---|---|---|---|---|
| K8s Required | β Not required | β Required | β Not required | β Not required |
| Installation Time | 30 min | 10 min (with K8s) | 2-3 days | Half day |
| Air-gap | β Bundle provided | |||
| GPU Native | β DCGM included | β Custom required | ||
| Xid Interpretation | β Built-in | β None | β None | β None |
| Operations CLI | β Built-in | β kubectl | β ansible-playbook | β None |
| Area | Technology |
|---|---|
| CLI | Go 1.21+ (single binary) |
| Monitoring | Prometheus, Grafana, Alertmanager |
| GPU Metrics | DCGM Exporter (NVIDIA), ROCm Exporter (AMD, planned) |
| Configuration Storage | YAML (No DB) |
| Node Communication | SSH (Agentless) |
| Large Scale | Prometheus Federation |
| Scheduler Integration | Slurm |
aami/
βββ cmd/ # Application entrypoints
βββ internal/ # Core packages
β βββ cli/ # CLI commands
β βββ config/ # Configuration management
β βββ ssh/ # SSH executor
β βββ installer/ # Component installers
β βββ xid/ # Xid error interpretation
β βββ health/ # GPU health scoring
β βββ nvlink/ # NVLink topology
β βββ federation/ # Prometheus federation
β βββ slurm/ # Slurm integration
β βββ multicluster/ # Multi-cluster management
β βββ backup/ # Backup & restore
β βββ upgrade/ # Upgrade management
βββ configs/ # Default configuration templates
βββ docs/ # Documentation
βββ examples/ # Examples
βββ scripts/ # Installation/utility scripts
βββ deploy/
βββ offline/ # Air-gap bundles
- One-click installation (
aami init) - Air-gap bundler (
aami bundle) - Node management CLI (
aami nodes) - Alert presets (
aami alerts) - Xid interpretation (
aami explain xid)
- NVLink topology visualization
- GPU Health Score
- Upgrade/Backup
- Operations tools
- Prometheus Federation (1k+ nodes)
- Slurm integration (Job-GPU correlation)
- Multi-cluster management
- ROCm exporter integration
- AMD error code interpretation
- Unified alert rules for NVIDIA/AMD
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions