API Metrics Reference

all-smi provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.

Starting API Mode

# Start API server on TCP port
all-smi api --port 9090

# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5

# Include process information
all-smi api --port 9090 --processes

Metrics are available at http://localhost:9090/metrics

Unix Domain Socket Support (Unix Only)

For local IPC scenarios, API mode supports Unix Domain Sockets:

# Use default socket path
all-smi api --socket
# Linux: /var/run/all-smi.sock (or /tmp/all-smi.sock)
# macOS: /tmp/all-smi.sock

# Use custom socket path
all-smi api --socket /custom/path/all-smi.sock

# TCP and Unix socket simultaneously
all-smi api --port 9090 --socket

# Unix socket only (disable TCP)
all-smi api --port 0 --socket

Access metrics via Unix socket:

curl --unix-socket /tmp/all-smi.sock http://localhost/metrics

# Python example
import requests_unixsocket
session = requests_unixsocket.Session()
r = session.get('http+unix://%2Ftmp%2Fall-smi.sock/metrics')

Security: Socket permissions are set to 0600 (owner-only access).

JSON Endpoints (Streaming + One-Shot)

Alongside /metrics, API mode exposes two JSON endpoints that share the same schema as the snapshot subcommand (schema: 1).

`GET /snapshot`

One-shot JSON payload reflecting the latest collection cycle. When the last cached frame is older than 2 × interval (or no cycle has completed yet), the handler forces a fresh collection so the response never silently serves stale data.

GET /snapshot
Content-Type: application/json
Cache-Control: no-store
X-Accel-Buffering: no

{
  "schema": 1,
  "timestamp": "2026-04-20T12:34:56Z",
  "hostname": "gpu-01",
  "gpus":   [ … ],
  "cpus":   [ … ],
  "memory": [ … ],
  "chassis":[ … ],
  "errors": []
}

Query parameters

Parameter	Default	Meaning
`include`	`gpu,cpu,memory,chassis`	Comma-separated sections. Valid values: `gpu`, `cpu`, `memory`, `chassis`, `process`, `storage`.
`pretty`	`0`	Pretty-print the JSON body when `1` / `true` / `yes`.

`GET /events`

Server-Sent Events stream. Emits one JSON frame per collection cycle.

GET /events
Accept: text/event-stream

event: snapshot
id: 2026-04-20T12:34:56Z
data: {"schema":1,"timestamp":"2026-04-20T12:34:56Z", …}

event: snapshot
id: 2026-04-20T12:34:57Z
data: { … }

Query parameters

Parameter	Default	Meaning
`include`	`gpu,cpu,memory,chassis`	Same grammar as `/snapshot`.
`throttle`	= collection interval	Minimum seconds between emitted events. Clamped to `>=` the collection interval.
`heartbeat`	`30`	Keep-alive comment interval in seconds.

Keep-alive and lag

The server emits a bare SSE comment every heartbeat seconds so reverse proxies do not idle-timeout the connection:

: keep-alive

When a client falls behind the server's broadcast buffer, the stream inserts a synthetic lag event and continues with the freshest live frame:

event: lag
data: {"dropped": 3}

Reconnect

Last-Event-ID is accepted but never triggers history replay — all-smi does not retain historical frames. On reconnect the client resumes with the next live frame regardless of the ID it sent.

Reverse-proxy guidance

The response advertises X-Accel-Buffering: no and Cache-Control: no-store. nginx operators should additionally set proxy_buffering off; on the location block. haproxy users want option http-buffer-request turned off for the SSE route.

Example clients

examples/sse_client.html — browser EventSource demo.
curl -N http://localhost:9090/events — raw inspection.

Security knobs

Env var	Default	Effect
`ALL_SMI_API_CORS_ALLOWED_ORIGINS`	(empty → no CORS)	Comma-separated list of origins permitted to read `/metrics`, `/snapshot`, and `/events` cross-origin. Set to `*` to revert to the pre-0.21 wildcard; a warning is logged in that case.
`ALL_SMI_API_MAX_SSE_SUBSCRIBERS`	`256`	Cap on concurrent `/events` subscribers. Extra clients get `503 Service Unavailable` with `Retry-After: 5`. Set to `0` to disable the cap.

The default CORS posture blocks cross-origin browser reads of the live telemetry stream. That includes process command lines (when --processes is enabled), usernames, GPU utilization, and power consumption — keep the allowlist narrow.

Process label truncation. command, process_name, and user fields are truncated at 256 / 128 / 128 bytes respectively on all output surfaces (Prometheus /metrics, JSON /snapshot, and SSE /events). Longer strings are emitted with a trailing ...(N bytes truncated) marker. This caps scrape-response amplification and limits the blast radius of secrets that happen to live in argv.

Available Metrics

GPU Metrics (All Platforms)

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	GPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	GPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	GPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	GPU temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	GPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	GPU frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	GPU device information	info	`gpu_index`, `gpu_name`, `driver_version`

Unified AI Acceleration Library Labels

The all_smi_gpu_info metric includes standardized labels for AI acceleration libraries across all GPU/accelerator platforms. These unified labels allow platform-agnostic queries and dashboards:

Label	Description	Example Values
`lib_name`	Name of the AI acceleration library	`CUDA`, `ROCm`, `Metal`
`lib_version`	Version of the AI acceleration library	`13.0`, `7.0.2`, `Metal 3`

Platform-Specific Library Mappings

Platform	lib_name	lib_version Source	Platform-Specific Label
NVIDIA GPU	`CUDA`	CUDA version	`cuda_version`
AMD GPU	`ROCm`	ROCm version	`rocm_version`
NVIDIA Jetson	`CUDA`	CUDA version	`cuda_version`
Apple Silicon	`Metal`	Metal version	N/A

Note: Platform-specific labels (e.g., cuda_version, rocm_version) are maintained for backward compatibility with existing queries and dashboards.

Example PromQL Queries

# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)

# Get all CUDA devices with version 12 or higher
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}

# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1

# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)

# Find all devices using Metal (Apple Silicon)
all_smi_gpu_info{lib_name="Metal"}

# Monitor library version consistency across cluster
count by (lib_name, lib_version) (all_smi_gpu_info) > 1

NVIDIA GPU Specific Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_pcie_gen_current`	Current PCIe generation	-	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_pcie_width_current`	Current PCIe link width	-	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_performance_state`	GPU performance state (P0=0 … P15=15; omitted when not reported)	-	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_temperature_threshold_slowdown_celsius`	Slowdown temperature threshold	celsius	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_temperature_threshold_shutdown_celsius`	Shutdown temperature threshold	celsius	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_temperature_threshold_max_operating_celsius`	Maximum operating temperature threshold	celsius	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_temperature_threshold_acoustic_celsius`	Acoustic (fan-noise) temperature threshold	celsius	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_clock_graphics_max_mhz`	Maximum graphics clock	MHz	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_clock_memory_max_mhz`	Maximum memory clock	MHz	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_power_limit_current_watts`	Current power limit	watts	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_power_limit_max_watts`	Maximum power limit	watts	`gpu`, `instance`, `gpu_uuid`, `gpu_index`

Notes:

Threshold metrics (temperature_threshold_*) and performance_state are NVIDIA-only. Each metric is emitted only when the driver exposes the value; hosts where the driver does not report a given threshold simply omit that metric line.
performance_state maps NVML PerformanceState variants: P0 (maximum performance) = 0 through P15 = 15. The Unknown sentinel is suppressed (None) rather than emitted.
The acoustic threshold is available on newer drivers and some GPU SKUs; older drivers leave it absent.

NVIDIA Hardware Details Metrics

Extended NVIDIA hardware detail metrics (NUMA topology, GSP firmware, NvLink topology, and GPU Performance Monitoring). All metrics in this group share the same four-label base set as other NVIDIA-specific metrics (gpu, instance, gpu_uuid, gpu_index) and are omitted entirely for non-NVIDIA devices, on older drivers that do not expose the underlying NVML APIs, or when the field value is unavailable.

Metric	Description	Unit	Labels
`all_smi_gpu_numa_node_id`	NUMA node the GPU is attached to; omitted when the host has no NUMA topology or the driver does not report one	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_gsp_firmware_mode`	GSP firmware mode: `0`=disabled, `1`=enabled, `2`=default; omitted on pre-R525 drivers or non-datacenter SKUs	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_gsp_firmware_version_info`	Info-style metric (value always 1) carrying the GSP firmware version string in a `version` label	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`, `version`
`all_smi_nvlink_remote_device_type`	Info-style metric (value always 1) per active NvLink; classification in `remote_type` label	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`, `link_index`, `remote_type`
`all_smi_gpu_sm_occupancy`	GPM-reported SM occupancy fraction (0.0–1.0); omitted on pre-Hopper GPUs or when GPM has not yet sampled	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`
`all_smi_gpu_memory_bandwidth_utilization`	GPM-reported DRAM bandwidth utilization fraction (0.0–1.0); omitted when GPM is unsupported or unsampled	gauge	`gpu`, `instance`, `gpu_uuid`, `gpu_index`

Label values for all_smi_nvlink_remote_device_type:

Label	Values	Description
`link_index`	`"0"`, `"1"`, …	NvLink port index on the GPU
`remote_type`	`"gpu"`, `"switch"`, `"ibmnpu"`, `"unknown"`	Classification of the remote endpoint

Notes:

all_smi_gpu_numa_node_id: NVML reports -1 for GPUs without a NUMA attachment; this value is canonicalised to None and the metric is omitted rather than emitting a negative number.
all_smi_gpu_gsp_firmware_version_info: the version label carries a string such as "550.54.15". Because the version is static for the lifetime of the driver, it is cached after the first successful NVML call.
all_smi_nvlink_remote_device_type: one metric row is emitted per active NvLink. A GPU with no active links produces no rows for this metric family.
all_smi_gpu_sm_occupancy and all_smi_gpu_memory_bandwidth_utilization: GPM requires a two-sample handshake before values are available. Until the handshake completes the exporter holds None for both fields and emits nothing, preventing spurious zero readings. These metrics are currently plumbing only — values are populated on Hopper and later hardware when the GPM handshake succeeds.
To simulate the full set of extended hardware detail metrics (including the thermal thresholds and performance_state listed in the NVIDIA GPU Specific Metrics table above) in development/testing without a modern NVIDIA driver, set ALL_SMI_MOCK_HARDWARE_DETAILS=1 when running with the mock feature. When unset, the mock omits these families to simulate an older driver that does not expose the underlying NVML APIs.

NVIDIA vGPU Metrics

NVIDIA vGPU metrics are emitted only on hosts with vGPU SR-IOV enabled. Non-vGPU hosts produce no output for these metric families.

Host-Level vGPU Metrics

Metric	Description	Unit	Labels
`all_smi_vgpu_host_mode`	vGPU host mode (0=NonSriov, 1=Sriov, 2=Disabled)	gauge	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `host_mode`
`all_smi_vgpu_scheduler_state`	vGPU scheduler ARR mode (0=unsupported, 1=off, 2=ARR)	gauge	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `arr_supported`
`all_smi_vgpu_scheduler_policy`	vGPU scheduler policy id	gauge	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`

Per-vGPU Instance Metrics

Metric	Description	Unit	Labels
`all_smi_vgpu_utilization`	Per-vGPU GPU utilization percentage (0-100)	percent	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `vgpu_id`, `vgpu_uuid`, `vgpu_type`, `vgpu_vm_id`
`all_smi_vgpu_memory_utilization`	Per-vGPU memory bandwidth utilization percentage (0-100)	percent	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `vgpu_id`, `vgpu_uuid`, `vgpu_type`, `vgpu_vm_id`
`all_smi_vgpu_memory_used_bytes`	Per-vGPU framebuffer memory used	bytes	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `vgpu_id`, `vgpu_uuid`, `vgpu_type`, `vgpu_vm_id`
`all_smi_vgpu_memory_total_bytes`	Per-vGPU framebuffer memory budget	bytes	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `vgpu_id`, `vgpu_uuid`, `vgpu_type`, `vgpu_vm_id`
`all_smi_vgpu_active`	Per-vGPU liveness (1=accounting PID active, 0=idle)	gauge	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `vgpu_id`, `vgpu_uuid`, `vgpu_type`, `vgpu_vm_id`

Notes:

vgpu_utilization is only emitted when NVML accounting data is available for the vGPU instance.
vgpu_memory_utilization is only emitted when NVML reports memory bandwidth usage.
The vgpu_vm_id label carries the owning VM identifier so remote scrapers can reconstruct the same VM column shown in the TUI.
To simulate vGPU responses in development/testing without real vGPU hardware, set ALL_SMI_MOCK_VGPU=1 when running with the mock feature.

NVIDIA MIG Metrics

NVIDIA MIG (Multi-Instance GPU) metrics are emitted only on hosts where at least one GPU has MIG mode enabled or has active MIG instances. Non-MIG hosts produce no output for these metric families.

Per-GPU MIG Mode

Metric	Description	Unit	Labels
`all_smi_gpu_mig_mode`	MIG mode per parent GPU (1=enabled, 0=disabled)	gauge	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`

Per-MIG-Instance Metrics

Metric	Description	Unit	Labels
`all_smi_mig_instance_utilization_gpu`	Per-MIG-instance GPU SM utilization percentage (0-100)	percent	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `mig_instance`, `mig_uuid`, `mig_profile`, `gpu_instance_id`, `compute_instance_id`
`all_smi_mig_instance_utilization_memory`	Per-MIG-instance memory bandwidth utilization (0-100)	percent	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `mig_instance`, `mig_uuid`, `mig_profile`, `gpu_instance_id`, `compute_instance_id`
`all_smi_mig_instance_memory_used_bytes`	Per-MIG-instance framebuffer memory used	bytes	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `mig_instance`, `mig_uuid`, `mig_profile`, `gpu_instance_id`, `compute_instance_id`
`all_smi_mig_instance_memory_total_bytes`	Per-MIG-instance framebuffer memory total carve-out	bytes	`gpu_index`, `gpu_uuid`, `gpu`, `instance`, `host`, `mig_instance`, `mig_uuid`, `mig_profile`, `gpu_instance_id`, `compute_instance_id`

Label descriptions for per-instance metrics:

Label	Description
`gpu_index`	Physical GPU index on the host
`gpu_uuid`	UUID of the parent physical GPU
`gpu`	Name of the parent GPU model
`instance`	Prometheus instance label (hostname or host:port)
`host`	Hostname of the server
`mig_instance`	MIG instance ordinal index within the parent GPU
`mig_uuid`	UUID of the MIG compute instance (assigned by NVML)
`mig_profile`	MIG profile name (e.g. `1g.5gb`, `3g.20gb`, `7g.40gb`)
`gpu_instance_id`	NVML GPU instance ID; empty string when not reported by the driver
`compute_instance_id`	NVML compute instance ID; empty string when not reported by the driver

Notes:

all_smi_mig_instance_utilization_gpu and all_smi_mig_instance_utilization_memory are emitted only when NVML reports utilization for that instance; the metric line is absent when the value is unavailable.
gpu_instance_id and compute_instance_id are emitted as empty string labels when NVML cannot report them, preserving round-trip fidelity in the remote parser.
MIG instances appear as nested rows under their parent GPU in the TUI, matched by gpu_uuid with a hostname+GPU-name fallback.
all_smi_gpu_mig_mode is emitted for every GPU that supports MIG (enabled or disabled), so consumers can detect mode transitions even when no instances are active.
To simulate MIG responses in development/testing without MIG hardware, set ALL_SMI_MOCK_MIG=1 when running with the mock feature.

NVIDIA Jetson Specific Metrics

Metric	Description	Unit	Labels
`all_smi_dla_utilization`	DLA (Deep Learning Accelerator) utilization	percent	`gpu_index`, `gpu_name`

AMD GPU Specific Metrics

AMD GPUs (Radeon and Instinct series) provide comprehensive monitoring through ROCm and the DRM subsystem:

Metric	Description	Unit	Labels
`all_smi_gpu_fan_speed_rpm`	GPU fan speed	RPM	`gpu_index`, `gpu_name`
`all_smi_amd_rocm_version`	AMD ROCm version installed	info	`instance`, `version`
`all_smi_gpu_memory_gtt_bytes`	GTT (GPU Translation Table) memory usage	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_vram_bytes`	VRAM (Video RAM) usage	bytes	`gpu_index`, `gpu_name`

Additional Details Available (in all_smi_gpu_info labels):

Driver Version: AMDGPU kernel driver version (e.g., "30.10.1")
ROCm Version: ROCm software stack version (e.g., "7.0.2")
PCIe Information: Current link generation and width, max GPU/system link capabilities
VBIOS: Version and date information
Power Management: Current, minimum, and maximum power cap values
ASIC Information: Device ID, revision ID, ASIC name
Memory Clock: Current memory clock frequency

Process Tracking:

AMD GPU process detection uses fdinfo from /proc/<pid>/fdinfo/ for accurate memory tracking
Tracks both VRAM and GTT memory usage per process
Available with --processes flag in API mode

Platform Requirements:

Requires ROCm drivers and libamdgpu_top library
Requires sudo access to /dev/dri devices or user in video/render groups
Only available in glibc builds (not musl static builds)

Apple Silicon GPU Specific Metrics

Metric	Description	Unit	Labels
`all_smi_ane_utilization`	ANE utilization	mW	`gpu_index`, `gpu_name`
`all_smi_ane_power_watts`	ANE power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_thermal_pressure_info`	Thermal pressure level	info	`gpu_index`, `gpu_name`, `level`

Note: For Apple Silicon (M1/M2/M3/M4), gpu_temperature_celsius is not available; thermal pressure level is provided instead.

Tenstorrent NPU Metrics

Basic NPU Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	NPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	NPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	NPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	NPU ASIC temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	NPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	NPU AI clock frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	NPU device information	info	`gpu_index`, `gpu_name`, `driver_version`
`all_smi_npu_firmware_info`	NPU firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `firmware`

Tenstorrent-Specific Metrics

Metric	Description	Unit	Labels
`all_smi_tenstorrent_board_info`	Board and architecture information	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `board_type`, `board_id`, `architecture`
`all_smi_tenstorrent_collection_method_info`	Data collection method used	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `method`
Firmware Versions
`all_smi_tenstorrent_arc_firmware_info`	ARC firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tenstorrent_eth_firmware_info`	Ethernet firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tenstorrent_ddr_firmware_info`	DDR firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tenstorrent_spibootrom_firmware_info`	SPI Boot ROM firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tenstorrent_firmware_date_info`	Firmware build date	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `date`
Temperature Sensors
`all_smi_tenstorrent_asic_temperature_celsius`	ASIC temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_vreg_temperature_celsius`	Voltage regulator temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_inlet_temperature_celsius`	Inlet temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_outlet1_temperature_celsius`	Outlet 1 temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_outlet2_temperature_celsius`	Outlet 2 temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`
Clock Frequencies
`all_smi_tenstorrent_aiclk_mhz`	AI clock frequency	MHz	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_axiclk_mhz`	AXI clock frequency	MHz	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_arcclk_mhz`	ARC clock frequency	MHz	`npu`, `instance`, `npu_uuid`, `npu_index`
Power and Electrical
`all_smi_tenstorrent_voltage_volts`	Core voltage	volts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_current_amperes`	Current draw	amperes	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_power_raw_watts`	Raw power consumption	watts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_tdp_limit_watts`	TDP limit	watts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_tdc_limit_amperes`	TDC limit	amperes	`npu`, `instance`, `npu_uuid`, `npu_index`
Status and Health
`all_smi_tenstorrent_heartbeat`	Device heartbeat counter	counter	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_arc0_health`	ARC0 health counter	counter	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_arc3_health`	ARC3 health counter	counter	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_faults`	Fault register value	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_throttler`	Throttler state register	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_pcie_status_info`	PCIe status register	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `status`
`all_smi_tenstorrent_eth_status_info`	Ethernet status register	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `port`, `status`
`all_smi_tenstorrent_ddr_status`	DDR status register	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
Fan Metrics
`all_smi_tenstorrent_fan_speed_percent`	Fan speed percentage	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_fan_rpm`	Fan speed in RPM	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
PCIe Information
`all_smi_tenstorrent_pcie_generation`	PCIe generation	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_pcie_width`	PCIe link width	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tenstorrent_pcie_address_info`	PCIe address	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `address`
`all_smi_tenstorrent_pcie_device_info`	PCIe device identification	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `vendor_id`, `device_id`
DRAM Information
`all_smi_tenstorrent_dram_info`	DRAM configuration	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `speed`

Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.

Rebellions NPU Metrics

Basic NPU Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	NPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	NPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	NPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	NPU temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	NPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	NPU clock frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	NPU device information	info	`gpu_index`, `gpu_name`, `driver_version`

Rebellions-Specific Metrics

Metric	Description	Unit	Labels
`all_smi_rebellions_device_info`	Device model and variant information	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `model`, `variant`
`all_smi_rebellions_firmware_info`	NPU firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `firmware_version`
`all_smi_rebellions_kmd_info`	Kernel Mode Driver version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `kmd_version`
`all_smi_rebellions_device_status`	Device operational status	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_rebellions_performance_state`	NPU performance state (P0-P15)	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_rebellions_pcie_generation`	PCIe generation (Gen4)	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_rebellions_pcie_width`	PCIe link width (x16)	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_rebellions_memory_bandwidth_gbps`	Memory bandwidth capacity	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_rebellions_compute_tops`	Compute capacity in TOPS	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`

Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.

Furiosa NPU Metrics

Basic NPU Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	NPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	NPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	NPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	NPU temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	NPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	NPU clock frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	NPU device information	info	`gpu_index`, `gpu_name`, `driver_version`

Furiosa-Specific Metrics

Metric	Description	Unit	Labels
`all_smi_furiosa_device_info`	Device architecture and model info	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `architecture`, `model`
`all_smi_furiosa_firmware_info`	NPU firmware version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `firmware_version`
`all_smi_furiosa_pert_info`	PERT (runtime) version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `pert_version`
`all_smi_furiosa_liveness_status`	Device liveness status	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_furiosa_core_count`	Number of cores in NPU	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_furiosa_core_status`	Core availability status	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`, `core`
`all_smi_furiosa_pe_utilization`	Processing Element utilization	percent	`npu`, `instance`, `npu_uuid`, `npu_index`, `core`
`all_smi_furiosa_core_frequency_mhz`	Per-core frequency	MHz	`npu`, `instance`, `npu_uuid`, `npu_index`, `core`
`all_smi_furiosa_power_governor_info`	Power governor mode	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `governor`
`all_smi_furiosa_error_count`	Cumulative error count	counter	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_furiosa_pcie_generation`	PCIe generation	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_furiosa_pcie_width`	PCIe link width	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_furiosa_memory_bandwidth_utilization`	Memory bandwidth utilization	percent	`npu`, `instance`, `npu_uuid`, `npu_index`

Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.

Intel Gaudi NPU Metrics

Basic NPU Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	NPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	NPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	NPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	NPU temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	NPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	NPU clock frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	NPU device information	info	`gpu_index`, `gpu_name`, `driver_version`

Intel Gaudi-Specific Metrics

Metric	Description	Unit	Labels
`all_smi_gaudi_device_info`	Device model and information	info	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_internal_name_info`	Internal device name (e.g., HL-325L)	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `internal_name`
`all_smi_gaudi_driver_info`	Habana driver version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_gaudi_aip_utilization_percent`	AIP (AI Processor) utilization	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_memory_used_bytes`	HBM memory used	bytes	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_memory_total_bytes`	HBM total memory	bytes	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_memory_utilization_percent`	HBM memory utilization percentage	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_power_draw_watts`	Current power consumption	watts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_power_max_watts`	Maximum power limit	watts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_power_utilization_percent`	Power utilization percentage	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_gaudi_temperature_celsius`	AIP temperature	celsius	`npu`, `instance`, `npu_uuid`, `npu_index`

Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the hl-smi command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.

Google TPU Metrics

Basic NPU Metrics

Metric	Description	Unit	Labels
`all_smi_gpu_utilization`	TPU utilization percentage	percent	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_used_bytes`	TPU memory used	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_memory_total_bytes`	TPU memory total	bytes	`gpu_index`, `gpu_name`
`all_smi_gpu_temperature_celsius`	TPU temperature	celsius	`gpu_index`, `gpu_name`
`all_smi_gpu_power_consumption_watts`	TPU power consumption	watts	`gpu_index`, `gpu_name`
`all_smi_gpu_frequency_mhz`	TPU clock frequency	MHz	`gpu_index`, `gpu_name`
`all_smi_gpu_info`	TPU device information	info	`gpu_index`, `gpu_name`, `driver_version`

TPU-Specific Metrics

Metric	Description	Unit	Labels
`all_smi_tpu_utilization_percent`	TPU duty cycle utilization	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_memory_used_bytes`	TPU HBM memory used	bytes	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_memory_total_bytes`	TPU HBM memory total	bytes	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_memory_utilization_percent`	TPU HBM memory utilization percentage	percent	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_chip_version_info`	TPU chip version information	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tpu_accelerator_type_info`	TPU accelerator type information	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `type`
`all_smi_tpu_core_count`	Number of TPU cores	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_tensorcore_count`	Number of TensorCores per chip	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_memory_type_info`	TPU memory type (HBM2/HBM2e/HBM3e)	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `type`
`all_smi_tpu_runtime_version_info`	TPU runtime/library version	info	`npu`, `instance`, `npu_uuid`, `npu_index`, `version`
`all_smi_tpu_power_max_watts`	TPU maximum power limit	watts	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_queue_size`	Number of pending HLO programs	gauge	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_exec_mean_microseconds`	HLO execution timing (mean)	µs	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_exec_p50_microseconds`	HLO execution timing (P50)	µs	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_exec_p90_microseconds`	HLO execution timing (P90)	µs	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_exec_p95_microseconds`	HLO execution timing (P95)	µs	`npu`, `instance`, `npu_uuid`, `npu_index`
`all_smi_tpu_hlo_exec_p999_microseconds`	HLO execution timing (P99.9)	µs	`npu`, `instance`, `npu_uuid`, `npu_index`

Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the tpu-info command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.

CPU Metrics (All Platforms)

Metric	Description	Unit	Labels
`all_smi_cpu_utilization`	CPU utilization percentage	percent	-
`all_smi_cpu_socket_count`	Number of CPU sockets	count	-
`all_smi_cpu_core_count`	Total number of CPU cores	count	-
`all_smi_cpu_thread_count`	Total number of CPU threads	count	-
`all_smi_cpu_frequency_mhz`	CPU frequency	MHz	-
`all_smi_cpu_temperature_celsius`	CPU temperature	celsius	-
`all_smi_cpu_power_consumption_watts`	CPU power consumption	watts	-
`all_smi_cpu_socket_utilization`	Per-socket CPU utilization	percent	`socket`

Apple Silicon CPU Specific Metrics

Metric	Description	Unit	Labels
`all_smi_cpu_p_core_count`	Number of performance cores	count	-
`all_smi_cpu_e_core_count`	Number of efficiency cores	count	-
`all_smi_cpu_gpu_core_count`	Number of integrated GPU cores	count	-
`all_smi_cpu_p_core_utilization`	P-core utilization percentage	percent	-
`all_smi_cpu_e_core_utilization`	E-core utilization percentage	percent	-
`all_smi_cpu_p_cluster_frequency_mhz`	P-cluster frequency	MHz	-
`all_smi_cpu_e_cluster_frequency_mhz`	E-cluster frequency	MHz	-

Memory Metrics (All Platforms)

Metric	Description	Unit	Labels
`all_smi_memory_total_bytes`	Total system memory	bytes	-
`all_smi_memory_used_bytes`	Used system memory	bytes	-
`all_smi_memory_available_bytes`	Available system memory	bytes	-
`all_smi_memory_free_bytes`	Free system memory	bytes	-
`all_smi_memory_utilization`	Memory utilization percentage	percent	-
`all_smi_swap_total_bytes`	Total swap space	bytes	-
`all_smi_swap_used_bytes`	Used swap space	bytes	-
`all_smi_swap_free_bytes`	Free swap space	bytes	-

Linux-Specific Memory Metrics

Metric	Description	Unit	Labels
`all_smi_memory_buffers_bytes`	Memory used for buffers	bytes	-
`all_smi_memory_cached_bytes`	Memory used for cache	bytes	-

Storage Metrics

Metric	Description	Unit	Labels
`all_smi_disk_total_bytes`	Total disk space	bytes	`mount_point`
`all_smi_disk_available_bytes`	Available disk space	bytes	`mount_point`

Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.

Chassis/Node-Level Metrics

Chassis metrics provide visibility into system-wide power consumption, thermal conditions, and cooling status at the node level. These metrics aggregate information from CPU, GPU, ANE, and BMC sensors.

Common Chassis Metrics (All Platforms)

Metric	Description	Unit	Labels
`all_smi_chassis_power_watts`	Total chassis power consumption (CPU+GPU+ANE)	watts	`hostname`, `instance`

Apple Silicon Chassis Metrics

Metric	Description	Unit	Labels
`all_smi_chassis_thermal_pressure_info`	Thermal pressure level	info	`hostname`, `instance`, `level`
`all_smi_chassis_cpu_power_watts`	CPU power consumption	watts	`hostname`, `instance`
`all_smi_chassis_gpu_power_watts`	GPU power consumption	watts	`hostname`, `instance`
`all_smi_chassis_ane_power_watts`	ANE (Apple Neural Engine) power	watts	`hostname`, `instance`

Server Chassis Metrics (BMC-enabled Systems)

Metric	Description	Unit	Labels
`all_smi_chassis_inlet_temperature_celsius`	Chassis inlet temperature	celsius	`hostname`, `instance`
`all_smi_chassis_outlet_temperature_celsius`	Chassis outlet temperature	celsius	`hostname`, `instance`
`all_smi_chassis_fan_speed_rpm`	Fan speed	RPM	`hostname`, `instance`, `fan_id`, `fan_name`

Note: Chassis metrics provide a unified view of node-level power consumption and thermal conditions, useful for cluster-wide capacity planning and power monitoring.

Runtime Environment Metrics

Metric	Description	Unit	Labels
`all_smi_runtime_environment`	Current runtime environment (container or VM)	gauge	`hostname`, `environment`
`all_smi_container_runtime_info`	Container runtime environment information	gauge	`hostname`, `runtime`, `container_id`
`all_smi_kubernetes_pod_info`	Kubernetes pod information (K8s only)	gauge	`hostname`, `pod_name`, `namespace`
`all_smi_virtualization_info`	Virtualization environment information	gauge	`hostname`, `vm_type`, `hypervisor`

Runtime environment metrics are detected at startup and provide information about the execution context:

Container environments: Docker, Kubernetes, Podman, containerd, LXC, CRI-O, Backend.AI
Virtualization platforms: VMware, VirtualBox, KVM, QEMU, Hyper-V, Xen, AWS EC2, Google Cloud, Azure, DigitalOcean, Parallels

Process Metrics (When --processes Flag is Used)

Metric	Description	Unit	Labels
`all_smi_gpu_process_memory_bytes`	GPU memory used by process	bytes	`gpu_index`, `gpu_name`, `pid`, `process_name`, `user`
`all_smi_gpu_process_sm_util`	Process GPU SM utilization	percent	`gpu_index`, `gpu_name`, `pid`, `process_name`, `user`
`all_smi_gpu_process_mem_util`	Process GPU memory utilization	percent	`gpu_index`, `gpu_name`, `pid`, `process_name`, `user`
`all_smi_gpu_process_enc_util`	Process GPU encoder utilization	percent	`gpu_index`, `gpu_name`, `pid`, `process_name`, `user`
`all_smi_gpu_process_dec_util`	Process GPU decoder utilization	percent	`gpu_index`, `gpu_name`, `pid`, `process_name`, `user`

Platform Support Matrix

Platform	GPU Metrics	CPU Metrics	Memory Metrics	Process Metrics
Linux + NVIDIA	✓ Full	✓ Full	✓ Full	✓ Full
Linux + Intel Gaudi	✓ Full	✓ Full	✓ Full	✗ N/A*******
Linux + Tenstorrent	✓ Full***	✓ Full	✓ Full	✗ N/A****
Linux + Rebellions	✓ Full	✓ Full	✓ Full	✗ N/A*****
Linux + Furiosa	✓ Full	✓ Full	✓ Full	✗ N/A******
Linux + Google TPU	✓ Full	✓ Full	✓ Full	✗ N/A********
macOS + Apple Silicon	✓ Partial*	✓ Enhanced**	✓ Full	✓ Basic
NVIDIA Jetson	✓ Full + DLA	✓ Full	✓ Full	✓ Full

*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead) **Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies ***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers ****Tenstorrent NPUs do not expose per-process GPU usage information *****Rebellions NPUs do not expose per-process GPU usage information ******Furiosa NPUs do not expose per-process GPU usage information *******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi ********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info

Example Prometheus Queries

Basic Monitoring

# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)

# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100

# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80

Power Monitoring

# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)

# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts

NVIDIA Thermal Thresholds and P-State

# GPUs within 5°C of the slowdown threshold (approaching throttle)
all_smi_gpu_temperature_threshold_slowdown_celsius - all_smi_gpu_temperature_celsius < 5

# GPUs within 2°C of the shutdown threshold (critical)
all_smi_gpu_temperature_threshold_shutdown_celsius - all_smi_gpu_temperature_celsius < 2

# Thermal headroom to slowdown per GPU
all_smi_gpu_temperature_threshold_slowdown_celsius - all_smi_gpu_temperature_celsius

# GPUs currently in a degraded performance state (not P0)
all_smi_gpu_performance_state > 0

NVIDIA vGPU Specific

# All vGPU-enabled physical GPUs by SR-IOV host mode
all_smi_vgpu_host_mode{host_mode="Sriov"}

# vGPU instances with high utilization (> 80%)
all_smi_vgpu_utilization > 80

# vGPU framebuffer occupancy per instance
all_smi_vgpu_memory_used_bytes / all_smi_vgpu_memory_total_bytes * 100

# Count active vGPU instances per physical GPU
count by (gpu_uuid) (all_smi_vgpu_active == 1)

# GPUs using Adaptive Round Robin scheduler
all_smi_vgpu_scheduler_state == 2

# vGPU memory bandwidth saturation
all_smi_vgpu_memory_utilization > 90

NVIDIA MIG Specific

# GPUs with MIG mode enabled
all_smi_gpu_mig_mode == 1

# MIG instances with high GPU SM utilization (> 80%)
all_smi_mig_instance_utilization_gpu > 80

# MIG framebuffer occupancy per instance
all_smi_mig_instance_memory_used_bytes / all_smi_mig_instance_memory_total_bytes * 100

# Count active MIG instances per parent GPU
count by (gpu_uuid) (all_smi_mig_instance_memory_total_bytes)

# MIG instances by profile type
count by (mig_profile) (all_smi_mig_instance_memory_total_bytes)

# Total memory carved out for MIG across the cluster
sum(all_smi_mig_instance_memory_total_bytes)

AMD GPU Specific

# AMD GPUs with high fan speed (potential cooling issues)
all_smi_gpu_fan_speed_rpm > 3000

# VRAM utilization percentage
(all_smi_gpu_memory_vram_bytes / all_smi_gpu_memory_total_bytes) * 100

# AMD GPUs approaching power cap
all_smi_gpu_power_consumption_watts / all_smi_amd_power_cap_watts > 0.9

# Memory bandwidth usage (VRAM + GTT)
all_smi_gpu_memory_vram_bytes + all_smi_gpu_memory_gtt_bytes

# AMD GPU thermal efficiency (utilization per degree)
all_smi_gpu_utilization / all_smi_gpu_temperature_celsius

Apple Silicon Specific

# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization

# ANE power consumption in watts
all_smi_ane_power_watts

Tenstorrent NPU Specific

# NPUs with high temperature on any sensor
max by (instance) ({
  __name__=~"all_smi_tenstorrent_.*_temperature_celsius",
  instance=~"tt.*"
}) > 80

# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type) 
  (all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)

# Throttling detection
all_smi_tenstorrent_throttler > 0

# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0

Rebellions NPU Specific

# NPUs in low performance state
all_smi_rebellions_performance_state > 0

# Devices with non-operational status
all_smi_rebellions_device_status != 1

# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts

# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9

Furiosa NPU Specific

# NPUs with unavailable cores
all_smi_furiosa_core_status == 0

# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)

# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1

# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}

# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80

Intel Gaudi NPU Specific

# NPUs with high AIP utilization
all_smi_gaudi_aip_utilization_percent > 80

# HBM memory utilization across cluster
avg by (instance) (all_smi_gaudi_memory_utilization_percent)

# NPUs approaching power limit
all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.9

# Power efficiency (AIP utilization per watt)
all_smi_gaudi_aip_utilization_percent / all_smi_gaudi_power_draw_watts

# NPUs running hot (temperature > 70°C)
all_smi_gaudi_temperature_celsius > 70

# Total HBM memory usage across all Gaudi NPUs
sum(all_smi_gaudi_memory_used_bytes)

# Gaudi NPUs by device variant
count by (internal_name) (all_smi_gaudi_internal_name_info)

# Driver version consistency check
count by (version) (all_smi_gaudi_driver_info) > 1

Google TPU Specific

# TPU utilization across all chips
avg(all_smi_tpu_utilization_percent)

# HBM memory utilization percentage
all_smi_tpu_memory_utilization_percent

# Count TPUs by accelerator type
count by (type) (all_smi_tpu_accelerator_type_info)

# Monitor HLO queue size
all_smi_tpu_hlo_queue_size > 5

# Alert on high HLO execution latency
all_smi_tpu_hlo_exec_p90_microseconds > 1000000

Process Monitoring

# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)

# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824

Chassis/Node-Level Monitoring

# Total power consumption across all nodes
sum(all_smi_chassis_power_watts)

# Nodes with high power consumption (> 3000W)
all_smi_chassis_power_watts > 3000

# Power breakdown by component (Apple Silicon)
sum by (hostname) (all_smi_chassis_cpu_power_watts)
sum by (hostname) (all_smi_chassis_gpu_power_watts)
sum by (hostname) (all_smi_chassis_ane_power_watts)

# Nodes with non-nominal thermal pressure
all_smi_chassis_thermal_pressure_info{level!="Nominal"}

# Average chassis power per node
avg(all_smi_chassis_power_watts)

# Nodes with high inlet temperature
all_smi_chassis_inlet_temperature_celsius > 35

# Delta between inlet and outlet temperature (thermal dissipation)
all_smi_chassis_outlet_temperature_celsius - all_smi_chassis_inlet_temperature_celsius

# Fan speed monitoring
avg by (hostname) (all_smi_chassis_fan_speed_rpm)

Runtime Environment Monitoring

# All containers running in Kubernetes
all_smi_container_runtime_info{runtime="Kubernetes"}

# All instances running in AWS EC2
all_smi_virtualization_info{vm_type="AWS EC2"}

# Containers running in Backend.AI
all_smi_runtime_environment{environment="Backend.AI"}

# Group metrics by runtime environment
sum by (environment) (all_smi_gpu_utilization) * on(hostname) group_left(environment) all_smi_runtime_environment

Integration Examples

Grafana Dashboard

Create a comprehensive monitoring dashboard with:

GPU utilization heatmap
Memory usage time series
Power consumption stacked graph
Temperature alerts
Process resource usage table

AlertManager Rules

groups:
  - name: gpu_alerts
    rules:
      - alert: HighGPUTemperature
        expr: all_smi_gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu_name }} is running hot"
          
      - alert: GPUMemoryExhausted
        expr: (all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.95
        for: 5m
        labels:
          severity: critical
          
      - alert: TenstorrentNPUFault
        expr: all_smi_tenstorrent_faults > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} has fault condition"
          
      - alert: TenstorrentNPUThrottling
        expr: all_smi_tenstorrent_throttler > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} is throttling"
          
      - alert: RebellionsNPULowPerformance
        expr: all_smi_rebellions_performance_state > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Rebellions NPU {{ $labels.instance }} stuck in low performance state P{{ $value }}"
          
      - alert: FuriosaNPUCoreFailure
        expr: all_smi_furiosa_core_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} has unavailable core {{ $labels.core }}"
          
      - alert: FuriosaNPUHighErrorRate
        expr: rate(all_smi_furiosa_error_count[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} experiencing high error rate"

      - alert: GaudiNPUHighTemperature
        expr: all_smi_gaudi_temperature_celsius > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} is running hot at {{ $value }}°C"

      - alert: GaudiNPUPowerLimitApproaching
        expr: all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} approaching power limit"

      - alert: GaudiNPUHBMMemoryExhausted
        expr: all_smi_gaudi_memory_utilization_percent > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} HBM memory nearly exhausted"

      - alert: ChassisHighPowerConsumption
        expr: all_smi_chassis_power_watts > 3500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} power consumption is high at {{ $value }}W"

      - alert: ChassisThermalPressureElevated
        expr: all_smi_chassis_thermal_pressure_info{level!="Nominal"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} thermal pressure elevated to {{ $labels.level }}"

      - alert: ChassisHighInletTemperature
        expr: all_smi_chassis_inlet_temperature_celsius > 40
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} inlet temperature is high at {{ $value }}°C"

Update Intervals

The metrics update interval can be configured:

Default: 3 seconds
Minimum recommended: 1 second
Maximum recommended: 60 seconds

Higher update rates provide more real-time data but increase system load. For production monitoring, 5-10 seconds is typically sufficient.

Notes

All metrics follow Prometheus naming conventions
Labels are used to differentiate between multiple devices
Info metrics (ending in _info) provide static metadata
Some metrics may not be available on all platforms
Process metrics require the --processes flag and may impact performance
Tenstorrent NPU metrics include comprehensive hardware monitoring data:
- Multiple temperature sensors (ASIC, voltage regulator, inlet/outlet)
- Detailed firmware versions and health counters
- Power limits (TDP/TDC) and throttling information
- PCIe and DDR status registers for diagnostics
Tenstorrent utilization is calculated based on power consumption as a proxy metric
Rebellions NPU metrics include:
- Performance state monitoring (P0-P15) for power management
- Device status and KMD version tracking
- Support for ATOM, ATOM+, and ATOM Max variants
- PCIe Gen4 x16 interface metrics
Furiosa NPU metrics include:
- Per-core PE utilization monitoring
- Core availability status tracking
- Power governor mode information
- Error counting and liveness monitoring
- RNGD architecture with 8 cores per NPU
Intel Gaudi NPU metrics include:
- AIP (AI Processor) utilization monitoring
- HBM memory usage and utilization tracking (up to 128GB per device)
- Power consumption with configurable power limits (up to 850W)
- Temperature monitoring
- Automatic device name mapping (HL-325L → Intel Gaudi 3 PCIe LP)
- Support for Gaudi 1/2/3 across PCIe, OAM, UBB, and HLS form factors
- Background process monitoring via hl-smi with circular buffer
Chassis/Node-level metrics include:
- Total chassis power consumption aggregating CPU, GPU, and ANE power
- Thermal pressure monitoring (Apple Silicon)
- Individual power component breakdown (CPU, GPU, ANE)
- Inlet/outlet temperature monitoring (BMC-enabled servers)
- Fan speed monitoring with per-fan granularity
NVIDIA vGPU metrics include:
- Host-level SR-IOV mode and scheduler configuration per physical GPU
- Per-vGPU instance utilization, framebuffer memory, and liveness
- VM identifier label (vgpu_vm_id) for correlating instances with virtual machines
- Completely silent on non-vGPU hosts — no empty metric families are emitted
- Requires NVIDIA vGPU-capable hardware and the GRID/vGPU driver stack
- Set ALL_SMI_MOCK_VGPU=1 (with --features mock build) to simulate vGPU data for development
NVIDIA MIG metrics include:
- Per-GPU MIG mode status (all_smi_gpu_mig_mode); emitted for every MIG-capable GPU regardless of whether instances are active
- Per-instance SM utilization, memory bandwidth utilization, and framebuffer used/total bytes
- MIG profile name (mig_profile, e.g. 1g.5gb, 3g.20gb) and NVML instance IDs for correlating with nvidia-smi mig
- TUI renders MIG instances as nested rows under each parent GPU, matched by UUID with hostname+GPU-name fallback
- Completely silent on non-MIG hosts — no empty metric families are emitted
- Set ALL_SMI_MOCK_MIG=1 (with --features mock build) to simulate MIG data for development
NVIDIA extended hardware detail metrics include:
- NUMA topology (all_smi_gpu_numa_node_id), GSP firmware mode/version, NvLink remote endpoint classification, and GPM SM occupancy / memory bandwidth utilization
- Thermal thresholds (all_smi_gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius) and the canonical all_smi_gpu_performance_state gauge
- Emitted only when the driver exposes the underlying NVML APIs; older drivers silently omit these metrics
- Set ALL_SMI_MOCK_HARDWARE_DETAILS=1 (with --features mock build) to have the mock emit the full extended hardware-detail set; when unset, the mock simulates an older driver

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

API Metrics Reference

Starting API Mode

Unix Domain Socket Support (Unix Only)

JSON Endpoints (Streaming + One-Shot)

GET /snapshot

GET /events

Security knobs

Available Metrics

GPU Metrics (All Platforms)

Unified AI Acceleration Library Labels

Platform-Specific Library Mappings

Example PromQL Queries

NVIDIA GPU Specific Metrics

NVIDIA Hardware Details Metrics

NVIDIA vGPU Metrics

Host-Level vGPU Metrics

Per-vGPU Instance Metrics

NVIDIA MIG Metrics

Per-GPU MIG Mode

Per-MIG-Instance Metrics

NVIDIA Jetson Specific Metrics

AMD GPU Specific Metrics

Apple Silicon GPU Specific Metrics

Tenstorrent NPU Metrics

Basic NPU Metrics

Tenstorrent-Specific Metrics

Rebellions NPU Metrics

Basic NPU Metrics

Rebellions-Specific Metrics

Furiosa NPU Metrics

Basic NPU Metrics

Furiosa-Specific Metrics

Intel Gaudi NPU Metrics

Basic NPU Metrics

Intel Gaudi-Specific Metrics

Google TPU Metrics

Basic NPU Metrics

TPU-Specific Metrics

CPU Metrics (All Platforms)

Apple Silicon CPU Specific Metrics

Memory Metrics (All Platforms)

Linux-Specific Memory Metrics

Storage Metrics

Chassis/Node-Level Metrics

Common Chassis Metrics (All Platforms)

Apple Silicon Chassis Metrics

Server Chassis Metrics (BMC-enabled Systems)

Runtime Environment Metrics

Process Metrics (When --processes Flag is Used)

Platform Support Matrix

Example Prometheus Queries

Basic Monitoring

Power Monitoring

NVIDIA Thermal Thresholds and P-State

NVIDIA vGPU Specific

NVIDIA MIG Specific

AMD GPU Specific

Apple Silicon Specific

Tenstorrent NPU Specific

Rebellions NPU Specific

Furiosa NPU Specific

Intel Gaudi NPU Specific

Google TPU Specific

Process Monitoring

Chassis/Node-Level Monitoring

Runtime Environment Monitoring

Integration Examples

Grafana Dashboard

AlertManager Rules

Update Intervals

Notes

`GET /snapshot`

`GET /events`