all-smi provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.
# Start API server on TCP port
all-smi api --port 9090
# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5
# Include process information
all-smi api --port 9090 --processesMetrics are available at http://localhost:9090/metrics
For local IPC scenarios, API mode supports Unix Domain Sockets:
# Use default socket path
all-smi api --socket
# Linux: /var/run/all-smi.sock (or /tmp/all-smi.sock)
# macOS: /tmp/all-smi.sock
# Use custom socket path
all-smi api --socket /custom/path/all-smi.sock
# TCP and Unix socket simultaneously
all-smi api --port 9090 --socket
# Unix socket only (disable TCP)
all-smi api --port 0 --socketAccess metrics via Unix socket:
curl --unix-socket /tmp/all-smi.sock http://localhost/metrics# Python example
import requests_unixsocket
session = requests_unixsocket.Session()
r = session.get('http+unix://%2Ftmp%2Fall-smi.sock/metrics')Security: Socket permissions are set to 0600 (owner-only access).
Alongside /metrics, API mode exposes two JSON endpoints that share the
same schema as the snapshot subcommand (schema: 1).
One-shot JSON payload reflecting the latest collection cycle. When the
last cached frame is older than 2 × interval (or no cycle has
completed yet), the handler forces a fresh collection so the response
never silently serves stale data.
GET /snapshot
Content-Type: application/json
Cache-Control: no-store
X-Accel-Buffering: no
{
"schema": 1,
"timestamp": "2026-04-20T12:34:56Z",
"hostname": "gpu-01",
"gpus": [ … ],
"cpus": [ … ],
"memory": [ … ],
"chassis":[ … ],
"errors": []
}
Query parameters
| Parameter | Default | Meaning |
|---|---|---|
include |
gpu,cpu,memory,chassis |
Comma-separated sections. Valid values: gpu, cpu, memory, chassis, process, storage. |
pretty |
0 |
Pretty-print the JSON body when 1 / true / yes. |
Server-Sent Events stream. Emits one JSON frame per collection cycle.
GET /events
Accept: text/event-stream
event: snapshot
id: 2026-04-20T12:34:56Z
data: {"schema":1,"timestamp":"2026-04-20T12:34:56Z", …}
event: snapshot
id: 2026-04-20T12:34:57Z
data: { … }
Query parameters
| Parameter | Default | Meaning |
|---|---|---|
include |
gpu,cpu,memory,chassis |
Same grammar as /snapshot. |
throttle |
= collection interval | Minimum seconds between emitted events. Clamped to >= the collection interval. |
heartbeat |
30 |
Keep-alive comment interval in seconds. |
Keep-alive and lag
The server emits a bare SSE comment every heartbeat seconds so reverse
proxies do not idle-timeout the connection:
: keep-alive
When a client falls behind the server's broadcast buffer, the stream
inserts a synthetic lag event and continues with the freshest live
frame:
event: lag
data: {"dropped": 3}
Reconnect
Last-Event-ID is accepted but never triggers history replay — all-smi
does not retain historical frames. On reconnect the client resumes with
the next live frame regardless of the ID it sent.
Reverse-proxy guidance
The response advertises X-Accel-Buffering: no and Cache-Control: no-store. nginx operators should additionally set proxy_buffering off; on the location block. haproxy users want option http-buffer-request
turned off for the SSE route.
Example clients
examples/sse_client.html— browserEventSourcedemo.curl -N http://localhost:9090/events— raw inspection.
| Env var | Default | Effect |
|---|---|---|
ALL_SMI_API_CORS_ALLOWED_ORIGINS |
(empty → no CORS) | Comma-separated list of origins permitted to read /metrics, /snapshot, and /events cross-origin. Set to * to revert to the pre-0.21 wildcard; a warning is logged in that case. |
ALL_SMI_API_MAX_SSE_SUBSCRIBERS |
256 |
Cap on concurrent /events subscribers. Extra clients get 503 Service Unavailable with Retry-After: 5. Set to 0 to disable the cap. |
The default CORS posture blocks cross-origin browser reads of the live
telemetry stream. That includes process command lines (when
--processes is enabled), usernames, GPU utilization, and power
consumption — keep the allowlist narrow.
Process label truncation. command, process_name, and user
fields are truncated at 256 / 128 / 128 bytes respectively on all
output surfaces (Prometheus /metrics, JSON /snapshot, and SSE
/events). Longer strings are emitted with a trailing
...(N bytes truncated) marker. This caps scrape-response
amplification and limits the blast radius of secrets that happen to
live in argv.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
GPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
GPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
GPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
GPU temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
GPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
GPU frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
GPU device information | info | gpu_index, gpu_name, driver_version |
The all_smi_gpu_info metric includes standardized labels for AI acceleration libraries across all GPU/accelerator platforms. These unified labels allow platform-agnostic queries and dashboards:
| Label | Description | Example Values |
|---|---|---|
lib_name |
Name of the AI acceleration library | CUDA, ROCm, Metal |
lib_version |
Version of the AI acceleration library | 13.0, 7.0.2, Metal 3 |
| Platform | lib_name | lib_version Source | Platform-Specific Label |
|---|---|---|---|
| NVIDIA GPU | CUDA |
CUDA version | cuda_version |
| AMD GPU | ROCm |
ROCm version | rocm_version |
| NVIDIA Jetson | CUDA |
CUDA version | cuda_version |
| Apple Silicon | Metal |
Metal version | N/A |
Note: Platform-specific labels (e.g., cuda_version, rocm_version) are maintained for backward compatibility with existing queries and dashboards.
# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)
# Get all CUDA devices with version 12 or higher
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}
# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1
# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)
# Find all devices using Metal (Apple Silicon)
all_smi_gpu_info{lib_name="Metal"}
# Monitor library version consistency across cluster
count by (lib_name, lib_version) (all_smi_gpu_info) > 1
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_pcie_gen_current |
Current PCIe generation | - | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_pcie_width_current |
Current PCIe link width | - | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_performance_state |
GPU performance state (P0=0 … P15=15; omitted when not reported) | - | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_temperature_threshold_slowdown_celsius |
Slowdown temperature threshold | celsius | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_temperature_threshold_shutdown_celsius |
Shutdown temperature threshold | celsius | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_temperature_threshold_max_operating_celsius |
Maximum operating temperature threshold | celsius | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_temperature_threshold_acoustic_celsius |
Acoustic (fan-noise) temperature threshold | celsius | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_clock_graphics_max_mhz |
Maximum graphics clock | MHz | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_clock_memory_max_mhz |
Maximum memory clock | MHz | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_power_limit_current_watts |
Current power limit | watts | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_power_limit_max_watts |
Maximum power limit | watts | gpu, instance, gpu_uuid, gpu_index |
Notes:
- Threshold metrics (
temperature_threshold_*) andperformance_stateare NVIDIA-only. Each metric is emitted only when the driver exposes the value; hosts where the driver does not report a given threshold simply omit that metric line. performance_statemaps NVMLPerformanceStatevariants: P0 (maximum performance) = 0 through P15 = 15. TheUnknownsentinel is suppressed (None) rather than emitted.- The acoustic threshold is available on newer drivers and some GPU SKUs; older drivers leave it absent.
Extended NVIDIA hardware detail metrics (NUMA topology, GSP firmware, NvLink topology, and GPU Performance Monitoring). All metrics in this group share the same four-label base set as other NVIDIA-specific metrics (gpu, instance, gpu_uuid, gpu_index) and are omitted entirely for non-NVIDIA devices, on older drivers that do not expose the underlying NVML APIs, or when the field value is unavailable.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_numa_node_id |
NUMA node the GPU is attached to; omitted when the host has no NUMA topology or the driver does not report one | gauge | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_gsp_firmware_mode |
GSP firmware mode: 0=disabled, 1=enabled, 2=default; omitted on pre-R525 drivers or non-datacenter SKUs |
gauge | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_gsp_firmware_version_info |
Info-style metric (value always 1) carrying the GSP firmware version string in a version label |
gauge | gpu, instance, gpu_uuid, gpu_index, version |
all_smi_nvlink_remote_device_type |
Info-style metric (value always 1) per active NvLink; classification in remote_type label |
gauge | gpu, instance, gpu_uuid, gpu_index, link_index, remote_type |
all_smi_gpu_sm_occupancy |
GPM-reported SM occupancy fraction (0.0–1.0); omitted on pre-Hopper GPUs or when GPM has not yet sampled | gauge | gpu, instance, gpu_uuid, gpu_index |
all_smi_gpu_memory_bandwidth_utilization |
GPM-reported DRAM bandwidth utilization fraction (0.0–1.0); omitted when GPM is unsupported or unsampled | gauge | gpu, instance, gpu_uuid, gpu_index |
Label values for all_smi_nvlink_remote_device_type:
| Label | Values | Description |
|---|---|---|
link_index |
"0", "1", … |
NvLink port index on the GPU |
remote_type |
"gpu", "switch", "ibmnpu", "unknown" |
Classification of the remote endpoint |
Notes:
all_smi_gpu_numa_node_id: NVML reports-1for GPUs without a NUMA attachment; this value is canonicalised toNoneand the metric is omitted rather than emitting a negative number.all_smi_gpu_gsp_firmware_version_info: theversionlabel carries a string such as"550.54.15". Because the version is static for the lifetime of the driver, it is cached after the first successful NVML call.all_smi_nvlink_remote_device_type: one metric row is emitted per active NvLink. A GPU with no active links produces no rows for this metric family.all_smi_gpu_sm_occupancyandall_smi_gpu_memory_bandwidth_utilization: GPM requires a two-sample handshake before values are available. Until the handshake completes the exporter holdsNonefor both fields and emits nothing, preventing spurious zero readings. These metrics are currently plumbing only — values are populated on Hopper and later hardware when the GPM handshake succeeds.- To simulate the full set of extended hardware detail metrics (including the thermal thresholds and
performance_statelisted in the NVIDIA GPU Specific Metrics table above) in development/testing without a modern NVIDIA driver, setALL_SMI_MOCK_HARDWARE_DETAILS=1when running with themockfeature. When unset, the mock omits these families to simulate an older driver that does not expose the underlying NVML APIs.
NVIDIA vGPU metrics are emitted only on hosts with vGPU SR-IOV enabled. Non-vGPU hosts produce no output for these metric families.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_vgpu_host_mode |
vGPU host mode (0=NonSriov, 1=Sriov, 2=Disabled) | gauge | gpu_index, gpu_uuid, gpu, instance, host, host_mode |
all_smi_vgpu_scheduler_state |
vGPU scheduler ARR mode (0=unsupported, 1=off, 2=ARR) | gauge | gpu_index, gpu_uuid, gpu, instance, host, arr_supported |
all_smi_vgpu_scheduler_policy |
vGPU scheduler policy id | gauge | gpu_index, gpu_uuid, gpu, instance, host |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_vgpu_utilization |
Per-vGPU GPU utilization percentage (0-100) | percent | gpu_index, gpu_uuid, gpu, instance, host, vgpu_id, vgpu_uuid, vgpu_type, vgpu_vm_id |
all_smi_vgpu_memory_utilization |
Per-vGPU memory bandwidth utilization percentage (0-100) | percent | gpu_index, gpu_uuid, gpu, instance, host, vgpu_id, vgpu_uuid, vgpu_type, vgpu_vm_id |
all_smi_vgpu_memory_used_bytes |
Per-vGPU framebuffer memory used | bytes | gpu_index, gpu_uuid, gpu, instance, host, vgpu_id, vgpu_uuid, vgpu_type, vgpu_vm_id |
all_smi_vgpu_memory_total_bytes |
Per-vGPU framebuffer memory budget | bytes | gpu_index, gpu_uuid, gpu, instance, host, vgpu_id, vgpu_uuid, vgpu_type, vgpu_vm_id |
all_smi_vgpu_active |
Per-vGPU liveness (1=accounting PID active, 0=idle) | gauge | gpu_index, gpu_uuid, gpu, instance, host, vgpu_id, vgpu_uuid, vgpu_type, vgpu_vm_id |
Notes:
vgpu_utilizationis only emitted when NVML accounting data is available for the vGPU instance.vgpu_memory_utilizationis only emitted when NVML reports memory bandwidth usage.- The
vgpu_vm_idlabel carries the owning VM identifier so remote scrapers can reconstruct the same VM column shown in the TUI. - To simulate vGPU responses in development/testing without real vGPU hardware, set
ALL_SMI_MOCK_VGPU=1when running with themockfeature.
NVIDIA MIG (Multi-Instance GPU) metrics are emitted only on hosts where at least one GPU has MIG mode enabled or has active MIG instances. Non-MIG hosts produce no output for these metric families.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_mig_mode |
MIG mode per parent GPU (1=enabled, 0=disabled) | gauge | gpu_index, gpu_uuid, gpu, instance, host |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_mig_instance_utilization_gpu |
Per-MIG-instance GPU SM utilization percentage (0-100) | percent | gpu_index, gpu_uuid, gpu, instance, host, mig_instance, mig_uuid, mig_profile, gpu_instance_id, compute_instance_id |
all_smi_mig_instance_utilization_memory |
Per-MIG-instance memory bandwidth utilization (0-100) | percent | gpu_index, gpu_uuid, gpu, instance, host, mig_instance, mig_uuid, mig_profile, gpu_instance_id, compute_instance_id |
all_smi_mig_instance_memory_used_bytes |
Per-MIG-instance framebuffer memory used | bytes | gpu_index, gpu_uuid, gpu, instance, host, mig_instance, mig_uuid, mig_profile, gpu_instance_id, compute_instance_id |
all_smi_mig_instance_memory_total_bytes |
Per-MIG-instance framebuffer memory total carve-out | bytes | gpu_index, gpu_uuid, gpu, instance, host, mig_instance, mig_uuid, mig_profile, gpu_instance_id, compute_instance_id |
Label descriptions for per-instance metrics:
| Label | Description |
|---|---|
gpu_index |
Physical GPU index on the host |
gpu_uuid |
UUID of the parent physical GPU |
gpu |
Name of the parent GPU model |
instance |
Prometheus instance label (hostname or host:port) |
host |
Hostname of the server |
mig_instance |
MIG instance ordinal index within the parent GPU |
mig_uuid |
UUID of the MIG compute instance (assigned by NVML) |
mig_profile |
MIG profile name (e.g. 1g.5gb, 3g.20gb, 7g.40gb) |
gpu_instance_id |
NVML GPU instance ID; empty string when not reported by the driver |
compute_instance_id |
NVML compute instance ID; empty string when not reported by the driver |
Notes:
all_smi_mig_instance_utilization_gpuandall_smi_mig_instance_utilization_memoryare emitted only when NVML reports utilization for that instance; the metric line is absent when the value is unavailable.gpu_instance_idandcompute_instance_idare emitted as empty string labels when NVML cannot report them, preserving round-trip fidelity in the remote parser.- MIG instances appear as nested rows under their parent GPU in the TUI, matched by
gpu_uuidwith a hostname+GPU-name fallback. all_smi_gpu_mig_modeis emitted for every GPU that supports MIG (enabled or disabled), so consumers can detect mode transitions even when no instances are active.- To simulate MIG responses in development/testing without MIG hardware, set
ALL_SMI_MOCK_MIG=1when running with themockfeature.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_dla_utilization |
DLA (Deep Learning Accelerator) utilization | percent | gpu_index, gpu_name |
AMD GPUs (Radeon and Instinct series) provide comprehensive monitoring through ROCm and the DRM subsystem:
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_fan_speed_rpm |
GPU fan speed | RPM | gpu_index, gpu_name |
all_smi_amd_rocm_version |
AMD ROCm version installed | info | instance, version |
all_smi_gpu_memory_gtt_bytes |
GTT (GPU Translation Table) memory usage | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_vram_bytes |
VRAM (Video RAM) usage | bytes | gpu_index, gpu_name |
Additional Details Available (in all_smi_gpu_info labels):
- Driver Version: AMDGPU kernel driver version (e.g., "30.10.1")
- ROCm Version: ROCm software stack version (e.g., "7.0.2")
- PCIe Information: Current link generation and width, max GPU/system link capabilities
- VBIOS: Version and date information
- Power Management: Current, minimum, and maximum power cap values
- ASIC Information: Device ID, revision ID, ASIC name
- Memory Clock: Current memory clock frequency
Process Tracking:
- AMD GPU process detection uses
fdinfofrom/proc/<pid>/fdinfo/for accurate memory tracking - Tracks both VRAM and GTT memory usage per process
- Available with
--processesflag in API mode
Platform Requirements:
- Requires ROCm drivers and
libamdgpu_toplibrary - Requires sudo access to
/dev/dridevices or user invideo/rendergroups - Only available in glibc builds (not musl static builds)
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_ane_utilization |
ANE utilization | mW | gpu_index, gpu_name |
all_smi_ane_power_watts |
ANE power consumption | watts | gpu_index, gpu_name |
all_smi_thermal_pressure_info |
Thermal pressure level | info | gpu_index, gpu_name, level |
Note: For Apple Silicon (M1/M2/M3/M4), gpu_temperature_celsius is not available; thermal pressure level is provided instead.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
NPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
NPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
NPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
NPU ASIC temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
NPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
NPU AI clock frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
NPU device information | info | gpu_index, gpu_name, driver_version |
all_smi_npu_firmware_info |
NPU firmware version | info | npu, instance, npu_uuid, npu_index, firmware |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_tenstorrent_board_info |
Board and architecture information | info | npu, instance, npu_uuid, npu_index, board_type, board_id, architecture |
all_smi_tenstorrent_collection_method_info |
Data collection method used | info | npu, instance, npu_uuid, npu_index, method |
| Firmware Versions | |||
all_smi_tenstorrent_arc_firmware_info |
ARC firmware version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tenstorrent_eth_firmware_info |
Ethernet firmware version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tenstorrent_ddr_firmware_info |
DDR firmware version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tenstorrent_spibootrom_firmware_info |
SPI Boot ROM firmware version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tenstorrent_firmware_date_info |
Firmware build date | info | npu, instance, npu_uuid, npu_index, date |
| Temperature Sensors | |||
all_smi_tenstorrent_asic_temperature_celsius |
ASIC temperature | celsius | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_vreg_temperature_celsius |
Voltage regulator temperature | celsius | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_inlet_temperature_celsius |
Inlet temperature | celsius | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_outlet1_temperature_celsius |
Outlet 1 temperature | celsius | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_outlet2_temperature_celsius |
Outlet 2 temperature | celsius | npu, instance, npu_uuid, npu_index |
| Clock Frequencies | |||
all_smi_tenstorrent_aiclk_mhz |
AI clock frequency | MHz | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_axiclk_mhz |
AXI clock frequency | MHz | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_arcclk_mhz |
ARC clock frequency | MHz | npu, instance, npu_uuid, npu_index |
| Power and Electrical | |||
all_smi_tenstorrent_voltage_volts |
Core voltage | volts | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_current_amperes |
Current draw | amperes | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_power_raw_watts |
Raw power consumption | watts | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_tdp_limit_watts |
TDP limit | watts | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_tdc_limit_amperes |
TDC limit | amperes | npu, instance, npu_uuid, npu_index |
| Status and Health | |||
all_smi_tenstorrent_heartbeat |
Device heartbeat counter | counter | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_arc0_health |
ARC0 health counter | counter | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_arc3_health |
ARC3 health counter | counter | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_faults |
Fault register value | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_throttler |
Throttler state register | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_pcie_status_info |
PCIe status register | info | npu, instance, npu_uuid, npu_index, status |
all_smi_tenstorrent_eth_status_info |
Ethernet status register | info | npu, instance, npu_uuid, npu_index, port, status |
all_smi_tenstorrent_ddr_status |
DDR status register | gauge | npu, instance, npu_uuid, npu_index |
| Fan Metrics | |||
all_smi_tenstorrent_fan_speed_percent |
Fan speed percentage | percent | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_fan_rpm |
Fan speed in RPM | gauge | npu, instance, npu_uuid, npu_index |
| PCIe Information | |||
all_smi_tenstorrent_pcie_generation |
PCIe generation | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_pcie_width |
PCIe link width | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tenstorrent_pcie_address_info |
PCIe address | info | npu, instance, npu_uuid, npu_index, address |
all_smi_tenstorrent_pcie_device_info |
PCIe device identification | info | npu, instance, npu_uuid, npu_index, vendor_id, device_id |
| DRAM Information | |||
all_smi_tenstorrent_dram_info |
DRAM configuration | info | npu, instance, npu_uuid, npu_index, speed |
Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
NPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
NPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
NPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
NPU temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
NPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
NPU clock frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
NPU device information | info | gpu_index, gpu_name, driver_version |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_rebellions_device_info |
Device model and variant information | info | npu, instance, npu_uuid, npu_index, model, variant |
all_smi_rebellions_firmware_info |
NPU firmware version | info | npu, instance, npu_uuid, npu_index, firmware_version |
all_smi_rebellions_kmd_info |
Kernel Mode Driver version | info | npu, instance, npu_uuid, npu_index, kmd_version |
all_smi_rebellions_device_status |
Device operational status | gauge | npu, instance, npu_uuid, npu_index |
all_smi_rebellions_performance_state |
NPU performance state (P0-P15) | gauge | npu, instance, npu_uuid, npu_index |
all_smi_rebellions_pcie_generation |
PCIe generation (Gen4) | gauge | npu, instance, npu_uuid, npu_index |
all_smi_rebellions_pcie_width |
PCIe link width (x16) | gauge | npu, instance, npu_uuid, npu_index |
all_smi_rebellions_memory_bandwidth_gbps |
Memory bandwidth capacity | gauge | npu, instance, npu_uuid, npu_index |
all_smi_rebellions_compute_tops |
Compute capacity in TOPS | gauge | npu, instance, npu_uuid, npu_index |
Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
NPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
NPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
NPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
NPU temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
NPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
NPU clock frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
NPU device information | info | gpu_index, gpu_name, driver_version |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_furiosa_device_info |
Device architecture and model info | info | npu, instance, npu_uuid, npu_index, architecture, model |
all_smi_furiosa_firmware_info |
NPU firmware version | info | npu, instance, npu_uuid, npu_index, firmware_version |
all_smi_furiosa_pert_info |
PERT (runtime) version | info | npu, instance, npu_uuid, npu_index, pert_version |
all_smi_furiosa_liveness_status |
Device liveness status | gauge | npu, instance, npu_uuid, npu_index |
all_smi_furiosa_core_count |
Number of cores in NPU | gauge | npu, instance, npu_uuid, npu_index |
all_smi_furiosa_core_status |
Core availability status | gauge | npu, instance, npu_uuid, npu_index, core |
all_smi_furiosa_pe_utilization |
Processing Element utilization | percent | npu, instance, npu_uuid, npu_index, core |
all_smi_furiosa_core_frequency_mhz |
Per-core frequency | MHz | npu, instance, npu_uuid, npu_index, core |
all_smi_furiosa_power_governor_info |
Power governor mode | info | npu, instance, npu_uuid, npu_index, governor |
all_smi_furiosa_error_count |
Cumulative error count | counter | npu, instance, npu_uuid, npu_index |
all_smi_furiosa_pcie_generation |
PCIe generation | gauge | npu, instance, npu_uuid, npu_index |
all_smi_furiosa_pcie_width |
PCIe link width | gauge | npu, instance, npu_uuid, npu_index |
all_smi_furiosa_memory_bandwidth_utilization |
Memory bandwidth utilization | percent | npu, instance, npu_uuid, npu_index |
Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
NPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
NPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
NPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
NPU temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
NPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
NPU clock frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
NPU device information | info | gpu_index, gpu_name, driver_version |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gaudi_device_info |
Device model and information | info | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_internal_name_info |
Internal device name (e.g., HL-325L) | info | npu, instance, npu_uuid, npu_index, internal_name |
all_smi_gaudi_driver_info |
Habana driver version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_gaudi_aip_utilization_percent |
AIP (AI Processor) utilization | percent | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_memory_used_bytes |
HBM memory used | bytes | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_memory_total_bytes |
HBM total memory | bytes | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_memory_utilization_percent |
HBM memory utilization percentage | percent | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_power_draw_watts |
Current power consumption | watts | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_power_max_watts |
Maximum power limit | watts | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_power_utilization_percent |
Power utilization percentage | percent | npu, instance, npu_uuid, npu_index |
all_smi_gaudi_temperature_celsius |
AIP temperature | celsius | npu, instance, npu_uuid, npu_index |
Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the hl-smi command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_utilization |
TPU utilization percentage | percent | gpu_index, gpu_name |
all_smi_gpu_memory_used_bytes |
TPU memory used | bytes | gpu_index, gpu_name |
all_smi_gpu_memory_total_bytes |
TPU memory total | bytes | gpu_index, gpu_name |
all_smi_gpu_temperature_celsius |
TPU temperature | celsius | gpu_index, gpu_name |
all_smi_gpu_power_consumption_watts |
TPU power consumption | watts | gpu_index, gpu_name |
all_smi_gpu_frequency_mhz |
TPU clock frequency | MHz | gpu_index, gpu_name |
all_smi_gpu_info |
TPU device information | info | gpu_index, gpu_name, driver_version |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_tpu_utilization_percent |
TPU duty cycle utilization | percent | npu, instance, npu_uuid, npu_index |
all_smi_tpu_memory_used_bytes |
TPU HBM memory used | bytes | npu, instance, npu_uuid, npu_index |
all_smi_tpu_memory_total_bytes |
TPU HBM memory total | bytes | npu, instance, npu_uuid, npu_index |
all_smi_tpu_memory_utilization_percent |
TPU HBM memory utilization percentage | percent | npu, instance, npu_uuid, npu_index |
all_smi_tpu_chip_version_info |
TPU chip version information | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tpu_accelerator_type_info |
TPU accelerator type information | info | npu, instance, npu_uuid, npu_index, type |
all_smi_tpu_core_count |
Number of TPU cores | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tpu_tensorcore_count |
Number of TensorCores per chip | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tpu_memory_type_info |
TPU memory type (HBM2/HBM2e/HBM3e) | info | npu, instance, npu_uuid, npu_index, type |
all_smi_tpu_runtime_version_info |
TPU runtime/library version | info | npu, instance, npu_uuid, npu_index, version |
all_smi_tpu_power_max_watts |
TPU maximum power limit | watts | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_queue_size |
Number of pending HLO programs | gauge | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_exec_mean_microseconds |
HLO execution timing (mean) | µs | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_exec_p50_microseconds |
HLO execution timing (P50) | µs | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_exec_p90_microseconds |
HLO execution timing (P90) | µs | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_exec_p95_microseconds |
HLO execution timing (P95) | µs | npu, instance, npu_uuid, npu_index |
all_smi_tpu_hlo_exec_p999_microseconds |
HLO execution timing (P99.9) | µs | npu, instance, npu_uuid, npu_index |
Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the tpu-info command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_cpu_utilization |
CPU utilization percentage | percent | - |
all_smi_cpu_socket_count |
Number of CPU sockets | count | - |
all_smi_cpu_core_count |
Total number of CPU cores | count | - |
all_smi_cpu_thread_count |
Total number of CPU threads | count | - |
all_smi_cpu_frequency_mhz |
CPU frequency | MHz | - |
all_smi_cpu_temperature_celsius |
CPU temperature | celsius | - |
all_smi_cpu_power_consumption_watts |
CPU power consumption | watts | - |
all_smi_cpu_socket_utilization |
Per-socket CPU utilization | percent | socket |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_cpu_p_core_count |
Number of performance cores | count | - |
all_smi_cpu_e_core_count |
Number of efficiency cores | count | - |
all_smi_cpu_gpu_core_count |
Number of integrated GPU cores | count | - |
all_smi_cpu_p_core_utilization |
P-core utilization percentage | percent | - |
all_smi_cpu_e_core_utilization |
E-core utilization percentage | percent | - |
all_smi_cpu_p_cluster_frequency_mhz |
P-cluster frequency | MHz | - |
all_smi_cpu_e_cluster_frequency_mhz |
E-cluster frequency | MHz | - |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_memory_total_bytes |
Total system memory | bytes | - |
all_smi_memory_used_bytes |
Used system memory | bytes | - |
all_smi_memory_available_bytes |
Available system memory | bytes | - |
all_smi_memory_free_bytes |
Free system memory | bytes | - |
all_smi_memory_utilization |
Memory utilization percentage | percent | - |
all_smi_swap_total_bytes |
Total swap space | bytes | - |
all_smi_swap_used_bytes |
Used swap space | bytes | - |
all_smi_swap_free_bytes |
Free swap space | bytes | - |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_memory_buffers_bytes |
Memory used for buffers | bytes | - |
all_smi_memory_cached_bytes |
Memory used for cache | bytes | - |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_disk_total_bytes |
Total disk space | bytes | mount_point |
all_smi_disk_available_bytes |
Available disk space | bytes | mount_point |
Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.
Chassis metrics provide visibility into system-wide power consumption, thermal conditions, and cooling status at the node level. These metrics aggregate information from CPU, GPU, ANE, and BMC sensors.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_chassis_power_watts |
Total chassis power consumption (CPU+GPU+ANE) | watts | hostname, instance |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_chassis_thermal_pressure_info |
Thermal pressure level | info | hostname, instance, level |
all_smi_chassis_cpu_power_watts |
CPU power consumption | watts | hostname, instance |
all_smi_chassis_gpu_power_watts |
GPU power consumption | watts | hostname, instance |
all_smi_chassis_ane_power_watts |
ANE (Apple Neural Engine) power | watts | hostname, instance |
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_chassis_inlet_temperature_celsius |
Chassis inlet temperature | celsius | hostname, instance |
all_smi_chassis_outlet_temperature_celsius |
Chassis outlet temperature | celsius | hostname, instance |
all_smi_chassis_fan_speed_rpm |
Fan speed | RPM | hostname, instance, fan_id, fan_name |
Note: Chassis metrics provide a unified view of node-level power consumption and thermal conditions, useful for cluster-wide capacity planning and power monitoring.
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_runtime_environment |
Current runtime environment (container or VM) | gauge | hostname, environment |
all_smi_container_runtime_info |
Container runtime environment information | gauge | hostname, runtime, container_id |
all_smi_kubernetes_pod_info |
Kubernetes pod information (K8s only) | gauge | hostname, pod_name, namespace |
all_smi_virtualization_info |
Virtualization environment information | gauge | hostname, vm_type, hypervisor |
Runtime environment metrics are detected at startup and provide information about the execution context:
- Container environments: Docker, Kubernetes, Podman, containerd, LXC, CRI-O, Backend.AI
- Virtualization platforms: VMware, VirtualBox, KVM, QEMU, Hyper-V, Xen, AWS EC2, Google Cloud, Azure, DigitalOcean, Parallels
| Metric | Description | Unit | Labels |
|---|---|---|---|
all_smi_gpu_process_memory_bytes |
GPU memory used by process | bytes | gpu_index, gpu_name, pid, process_name, user |
all_smi_gpu_process_sm_util |
Process GPU SM utilization | percent | gpu_index, gpu_name, pid, process_name, user |
all_smi_gpu_process_mem_util |
Process GPU memory utilization | percent | gpu_index, gpu_name, pid, process_name, user |
all_smi_gpu_process_enc_util |
Process GPU encoder utilization | percent | gpu_index, gpu_name, pid, process_name, user |
all_smi_gpu_process_dec_util |
Process GPU decoder utilization | percent | gpu_index, gpu_name, pid, process_name, user |
| Platform | GPU Metrics | CPU Metrics | Memory Metrics | Process Metrics |
|---|---|---|---|---|
| Linux + NVIDIA | ✓ Full | ✓ Full | ✓ Full | ✓ Full |
| Linux + Intel Gaudi | ✓ Full | ✓ Full | ✓ Full | ✗ N/A******* |
| Linux + Tenstorrent | ✓ Full*** | ✓ Full | ✓ Full | ✗ N/A**** |
| Linux + Rebellions | ✓ Full | ✓ Full | ✓ Full | ✗ N/A***** |
| Linux + Furiosa | ✓ Full | ✓ Full | ✓ Full | ✗ N/A****** |
| Linux + Google TPU | ✓ Full | ✓ Full | ✓ Full | ✗ N/A******** |
| macOS + Apple Silicon | ✓ Partial* | ✓ Enhanced** | ✓ Full | ✓ Basic |
| NVIDIA Jetson | ✓ Full + DLA | ✓ Full | ✓ Full | ✓ Full |
*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead) **Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies ***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers ****Tenstorrent NPUs do not expose per-process GPU usage information *****Rebellions NPUs do not expose per-process GPU usage information ******Furiosa NPUs do not expose per-process GPU usage information *******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi ********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info
# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)
# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100
# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80
# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)
# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts
# GPUs within 5°C of the slowdown threshold (approaching throttle)
all_smi_gpu_temperature_threshold_slowdown_celsius - all_smi_gpu_temperature_celsius < 5
# GPUs within 2°C of the shutdown threshold (critical)
all_smi_gpu_temperature_threshold_shutdown_celsius - all_smi_gpu_temperature_celsius < 2
# Thermal headroom to slowdown per GPU
all_smi_gpu_temperature_threshold_slowdown_celsius - all_smi_gpu_temperature_celsius
# GPUs currently in a degraded performance state (not P0)
all_smi_gpu_performance_state > 0
# All vGPU-enabled physical GPUs by SR-IOV host mode
all_smi_vgpu_host_mode{host_mode="Sriov"}
# vGPU instances with high utilization (> 80%)
all_smi_vgpu_utilization > 80
# vGPU framebuffer occupancy per instance
all_smi_vgpu_memory_used_bytes / all_smi_vgpu_memory_total_bytes * 100
# Count active vGPU instances per physical GPU
count by (gpu_uuid) (all_smi_vgpu_active == 1)
# GPUs using Adaptive Round Robin scheduler
all_smi_vgpu_scheduler_state == 2
# vGPU memory bandwidth saturation
all_smi_vgpu_memory_utilization > 90
# GPUs with MIG mode enabled
all_smi_gpu_mig_mode == 1
# MIG instances with high GPU SM utilization (> 80%)
all_smi_mig_instance_utilization_gpu > 80
# MIG framebuffer occupancy per instance
all_smi_mig_instance_memory_used_bytes / all_smi_mig_instance_memory_total_bytes * 100
# Count active MIG instances per parent GPU
count by (gpu_uuid) (all_smi_mig_instance_memory_total_bytes)
# MIG instances by profile type
count by (mig_profile) (all_smi_mig_instance_memory_total_bytes)
# Total memory carved out for MIG across the cluster
sum(all_smi_mig_instance_memory_total_bytes)
# AMD GPUs with high fan speed (potential cooling issues)
all_smi_gpu_fan_speed_rpm > 3000
# VRAM utilization percentage
(all_smi_gpu_memory_vram_bytes / all_smi_gpu_memory_total_bytes) * 100
# AMD GPUs approaching power cap
all_smi_gpu_power_consumption_watts / all_smi_amd_power_cap_watts > 0.9
# Memory bandwidth usage (VRAM + GTT)
all_smi_gpu_memory_vram_bytes + all_smi_gpu_memory_gtt_bytes
# AMD GPU thermal efficiency (utilization per degree)
all_smi_gpu_utilization / all_smi_gpu_temperature_celsius
# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization
# ANE power consumption in watts
all_smi_ane_power_watts
# NPUs with high temperature on any sensor
max by (instance) ({
__name__=~"all_smi_tenstorrent_.*_temperature_celsius",
instance=~"tt.*"
}) > 80
# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type)
(all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)
# Throttling detection
all_smi_tenstorrent_throttler > 0
# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0
# NPUs in low performance state
all_smi_rebellions_performance_state > 0
# Devices with non-operational status
all_smi_rebellions_device_status != 1
# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts
# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9
# NPUs with unavailable cores
all_smi_furiosa_core_status == 0
# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)
# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1
# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}
# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80
# NPUs with high AIP utilization
all_smi_gaudi_aip_utilization_percent > 80
# HBM memory utilization across cluster
avg by (instance) (all_smi_gaudi_memory_utilization_percent)
# NPUs approaching power limit
all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.9
# Power efficiency (AIP utilization per watt)
all_smi_gaudi_aip_utilization_percent / all_smi_gaudi_power_draw_watts
# NPUs running hot (temperature > 70°C)
all_smi_gaudi_temperature_celsius > 70
# Total HBM memory usage across all Gaudi NPUs
sum(all_smi_gaudi_memory_used_bytes)
# Gaudi NPUs by device variant
count by (internal_name) (all_smi_gaudi_internal_name_info)
# Driver version consistency check
count by (version) (all_smi_gaudi_driver_info) > 1
# TPU utilization across all chips
avg(all_smi_tpu_utilization_percent)
# HBM memory utilization percentage
all_smi_tpu_memory_utilization_percent
# Count TPUs by accelerator type
count by (type) (all_smi_tpu_accelerator_type_info)
# Monitor HLO queue size
all_smi_tpu_hlo_queue_size > 5
# Alert on high HLO execution latency
all_smi_tpu_hlo_exec_p90_microseconds > 1000000
# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)
# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824
# Total power consumption across all nodes
sum(all_smi_chassis_power_watts)
# Nodes with high power consumption (> 3000W)
all_smi_chassis_power_watts > 3000
# Power breakdown by component (Apple Silicon)
sum by (hostname) (all_smi_chassis_cpu_power_watts)
sum by (hostname) (all_smi_chassis_gpu_power_watts)
sum by (hostname) (all_smi_chassis_ane_power_watts)
# Nodes with non-nominal thermal pressure
all_smi_chassis_thermal_pressure_info{level!="Nominal"}
# Average chassis power per node
avg(all_smi_chassis_power_watts)
# Nodes with high inlet temperature
all_smi_chassis_inlet_temperature_celsius > 35
# Delta between inlet and outlet temperature (thermal dissipation)
all_smi_chassis_outlet_temperature_celsius - all_smi_chassis_inlet_temperature_celsius
# Fan speed monitoring
avg by (hostname) (all_smi_chassis_fan_speed_rpm)
# All containers running in Kubernetes
all_smi_container_runtime_info{runtime="Kubernetes"}
# All instances running in AWS EC2
all_smi_virtualization_info{vm_type="AWS EC2"}
# Containers running in Backend.AI
all_smi_runtime_environment{environment="Backend.AI"}
# Group metrics by runtime environment
sum by (environment) (all_smi_gpu_utilization) * on(hostname) group_left(environment) all_smi_runtime_environment
Create a comprehensive monitoring dashboard with:
- GPU utilization heatmap
- Memory usage time series
- Power consumption stacked graph
- Temperature alerts
- Process resource usage table
groups:
- name: gpu_alerts
rules:
- alert: HighGPUTemperature
expr: all_smi_gpu_temperature_celsius > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_name }} is running hot"
- alert: GPUMemoryExhausted
expr: (all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.95
for: 5m
labels:
severity: critical
- alert: TenstorrentNPUFault
expr: all_smi_tenstorrent_faults > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Tenstorrent NPU {{ $labels.instance }} has fault condition"
- alert: TenstorrentNPUThrottling
expr: all_smi_tenstorrent_throttler > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Tenstorrent NPU {{ $labels.instance }} is throttling"
- alert: RebellionsNPULowPerformance
expr: all_smi_rebellions_performance_state > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Rebellions NPU {{ $labels.instance }} stuck in low performance state P{{ $value }}"
- alert: FuriosaNPUCoreFailure
expr: all_smi_furiosa_core_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Furiosa NPU {{ $labels.instance }} has unavailable core {{ $labels.core }}"
- alert: FuriosaNPUHighErrorRate
expr: rate(all_smi_furiosa_error_count[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Furiosa NPU {{ $labels.instance }} experiencing high error rate"
- alert: GaudiNPUHighTemperature
expr: all_smi_gaudi_temperature_celsius > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} is running hot at {{ $value }}°C"
- alert: GaudiNPUPowerLimitApproaching
expr: all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} approaching power limit"
- alert: GaudiNPUHBMMemoryExhausted
expr: all_smi_gaudi_memory_utilization_percent > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} HBM memory nearly exhausted"
- alert: ChassisHighPowerConsumption
expr: all_smi_chassis_power_watts > 3500
for: 5m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} power consumption is high at {{ $value }}W"
- alert: ChassisThermalPressureElevated
expr: all_smi_chassis_thermal_pressure_info{level!="Nominal"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} thermal pressure elevated to {{ $labels.level }}"
- alert: ChassisHighInletTemperature
expr: all_smi_chassis_inlet_temperature_celsius > 40
for: 5m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} inlet temperature is high at {{ $value }}°C"The metrics update interval can be configured:
- Default: 3 seconds
- Minimum recommended: 1 second
- Maximum recommended: 60 seconds
Higher update rates provide more real-time data but increase system load. For production monitoring, 5-10 seconds is typically sufficient.
- All metrics follow Prometheus naming conventions
- Labels are used to differentiate between multiple devices
- Info metrics (ending in
_info) provide static metadata - Some metrics may not be available on all platforms
- Process metrics require the
--processesflag and may impact performance - Tenstorrent NPU metrics include comprehensive hardware monitoring data:
- Multiple temperature sensors (ASIC, voltage regulator, inlet/outlet)
- Detailed firmware versions and health counters
- Power limits (TDP/TDC) and throttling information
- PCIe and DDR status registers for diagnostics
- Tenstorrent utilization is calculated based on power consumption as a proxy metric
- Rebellions NPU metrics include:
- Performance state monitoring (P0-P15) for power management
- Device status and KMD version tracking
- Support for ATOM, ATOM+, and ATOM Max variants
- PCIe Gen4 x16 interface metrics
- Furiosa NPU metrics include:
- Per-core PE utilization monitoring
- Core availability status tracking
- Power governor mode information
- Error counting and liveness monitoring
- RNGD architecture with 8 cores per NPU
- Intel Gaudi NPU metrics include:
- AIP (AI Processor) utilization monitoring
- HBM memory usage and utilization tracking (up to 128GB per device)
- Power consumption with configurable power limits (up to 850W)
- Temperature monitoring
- Automatic device name mapping (HL-325L → Intel Gaudi 3 PCIe LP)
- Support for Gaudi 1/2/3 across PCIe, OAM, UBB, and HLS form factors
- Background process monitoring via hl-smi with circular buffer
- Chassis/Node-level metrics include:
- Total chassis power consumption aggregating CPU, GPU, and ANE power
- Thermal pressure monitoring (Apple Silicon)
- Individual power component breakdown (CPU, GPU, ANE)
- Inlet/outlet temperature monitoring (BMC-enabled servers)
- Fan speed monitoring with per-fan granularity
- NVIDIA vGPU metrics include:
- Host-level SR-IOV mode and scheduler configuration per physical GPU
- Per-vGPU instance utilization, framebuffer memory, and liveness
- VM identifier label (
vgpu_vm_id) for correlating instances with virtual machines - Completely silent on non-vGPU hosts — no empty metric families are emitted
- Requires NVIDIA vGPU-capable hardware and the GRID/vGPU driver stack
- Set
ALL_SMI_MOCK_VGPU=1(with--features mockbuild) to simulate vGPU data for development
- NVIDIA MIG metrics include:
- Per-GPU MIG mode status (
all_smi_gpu_mig_mode); emitted for every MIG-capable GPU regardless of whether instances are active - Per-instance SM utilization, memory bandwidth utilization, and framebuffer used/total bytes
- MIG profile name (
mig_profile, e.g.1g.5gb,3g.20gb) and NVML instance IDs for correlating withnvidia-smi mig - TUI renders MIG instances as nested rows under each parent GPU, matched by UUID with hostname+GPU-name fallback
- Completely silent on non-MIG hosts — no empty metric families are emitted
- Set
ALL_SMI_MOCK_MIG=1(with--features mockbuild) to simulate MIG data for development
- Per-GPU MIG mode status (
- NVIDIA extended hardware detail metrics include:
- NUMA topology (
all_smi_gpu_numa_node_id), GSP firmware mode/version, NvLink remote endpoint classification, and GPM SM occupancy / memory bandwidth utilization - Thermal thresholds (
all_smi_gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius) and the canonicalall_smi_gpu_performance_stategauge - Emitted only when the driver exposes the underlying NVML APIs; older drivers silently omit these metrics
- Set
ALL_SMI_MOCK_HARDWARE_DETAILS=1(with--features mockbuild) to have the mock emit the full extended hardware-detail set; when unset, the mock simulates an older driver
- NUMA topology (