feat(runners): show runner hardware (instance type + accelerator) in admin by chocobar · Pull Request #2734 · helixml/helix

chocobar · 2026-06-25T15:25:47Z

Improvement 1 of 2 (runner hardware visibility). On the admin Runners page, selecting a runner now shows its architecture so you can pick a compatible profile.

What it adds

Neuron detection in gpudetect (glob /dev/neuron* → one GPUStatus per chip, vendor=neuron). Today neuron runners report an empty GPU list.
Instance type via IMDSv2 in sandbox-heartbeat (1.5s timeout, fail-silent).
instance_type on the heartbeat payload + sandbox_instances row (GORM AutoMigrate adds the column); debug DTO surfaces per-runner instance_type/gpu_vendor/gpus.
Hardware card in Runners.tsx: instance type + accelerator inventory (derives Neuron family + NeuronCore count from the instance type, e.g. "1 × AWS Inferentia2 — 2 NeuronCores"; GPU model for NVIDIA/AMD).

Bare-metal safe (e.g. prime)

The NVIDIA nvidia-smi path is untouched. Neuron glob returns nothing on non-neuron hosts; IMDS fails silently off-AWS (no hang) → instance type just shows "Bare metal / unknown" and the GPU model still renders as today.

Validated

Live on inf2.8xlarge: IMDSv2 → inf2.8xlarge, glob → /dev/neuron0. Unit test for the neuron device regex. All changed Go packages build; Runners.tsx type-checks.

Next: Improvement 2 (explicit per-substrate runner placement).

🤖 Generated with Claude Code

…admin Improvement 1: the admin Runners page now shows a selected runner's architecture so an operator can pick a compatible profile. - gpudetect: detect AWS Neuron devices (glob /dev/neuron*), one GPUStatus per chip with vendor=neuron. Returns nothing on non-neuron hosts. - sandbox-heartbeat: detect cloud instance type via IMDSv2 (1.5s timeout, fail-silent). Empty on bare-metal (prime) and non-AWS - no hang. - heartbeat payload + sandbox_instances row gain instance_type; GORM AutoMigrate adds the column. - debug DTO surfaces per-runner instance_type, gpu_vendor and gpus. - Runners.tsx: new Hardware card showing instance type + accelerator inventory (derives Neuron family + NeuronCore count from instance type; shows GPU model for NVIDIA/AMD; 'Bare metal / unknown' when no instance type). Bare-metal-safe: NVIDIA path untouched; neuron glob + IMDS both no-op cleanly off-AWS. Validated live on inf2.8xlarge (IMDS -> inf2.8xlarge, /dev/neuron0 detected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Review fixes: - detectInstanceType: validate the IMDS body against an instance-type regex and cap the read, so a rogue HTTP 200 at the link-local IP (captive portal / on-prem metadata on a non-AWS host) can't store an oversized string that overflows varchar(100) and fails the whole heartbeat UPDATE. Also check the IMDSv2 token status. Adds a unit test for the validation. - Cache the instance type after first success (it's immutable) so IMDS isn't hit every 30s heartbeat. - Runners.tsx: prefer the per-accelerator vendor for Neuron (GPU_VENDOR env is empty on neuron hosts); fix 'AWS AWS Neuron' double-prefix when instance type is unavailable. - Regenerate swagger for the new debug-DTO fields (swag v1.16.4).

chocobar · 2026-06-25T15:39:50Z

Review pass (4 agents) + fixes applied. Verified good: no inference-router regression (neuron GPUs correctly fail NVIDIA constraints), NeuronCore counts fact-checked vs AWS docs, timeout cannot block the heartbeat loop (literal IP → no DNS; 1.5s cap; bounded by the existing 5s probe budget), data flow confirmed end-to-end.

Fixed from review:

IMDS body now validated + length-capped (regex ^[a-z][a-z0-9-]{0,30}\.[a-z0-9]+$) + token status checked — closes a real bug where a rogue HTTP 200 at the link-local IP on a non-AWS host could overflow varchar(100) and fail the whole heartbeat UPDATE (flapping the runner offline). Unit test added.
Instance type cached after first success (immutable) — no IMDS hit every 30s.
UI: prefer per-accelerator vendor for Neuron; fixed an 'AWS AWS Neuron' double-prefix.
Swagger regenerated for the new DTO fields.

Deployment note: the detection runs in sandbox-heartbeat, baked into the sandbox image — so existing runners show 'Bare metal / unknown' until ./stack build-sandbox rebuilds and redeploys. The API/frontend half is hot-reloadable; the detection half is not.

chocobar and others added 2 commits June 25, 2026 16:25

chocobar merged commit 094f846 into main Jun 25, 2026
5 checks passed

chocobar deleted the feat/runner-hardware-visibility branch June 25, 2026 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runners): show runner hardware (instance type + accelerator) in admin#2734

feat(runners): show runner hardware (instance type + accelerator) in admin#2734
chocobar merged 2 commits into
mainfrom
feat/runner-hardware-visibility

chocobar commented Jun 25, 2026

Uh oh!

chocobar commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chocobar commented Jun 25, 2026

What it adds

Bare-metal safe (e.g. prime)

Validated

Uh oh!

chocobar commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant