Skip to content

feat(runners): show runner hardware (instance type + accelerator) in admin#2734

Merged
chocobar merged 2 commits into
mainfrom
feat/runner-hardware-visibility
Jun 25, 2026
Merged

feat(runners): show runner hardware (instance type + accelerator) in admin#2734
chocobar merged 2 commits into
mainfrom
feat/runner-hardware-visibility

Conversation

@chocobar

Copy link
Copy Markdown
Collaborator

Improvement 1 of 2 (runner hardware visibility). On the admin Runners page, selecting a runner now shows its architecture so you can pick a compatible profile.

What it adds

  • Neuron detection in gpudetect (glob /dev/neuron* → one GPUStatus per chip, vendor=neuron). Today neuron runners report an empty GPU list.
  • Instance type via IMDSv2 in sandbox-heartbeat (1.5s timeout, fail-silent).
  • instance_type on the heartbeat payload + sandbox_instances row (GORM AutoMigrate adds the column); debug DTO surfaces per-runner instance_type/gpu_vendor/gpus.
  • Hardware card in Runners.tsx: instance type + accelerator inventory (derives Neuron family + NeuronCore count from the instance type, e.g. "1 × AWS Inferentia2 — 2 NeuronCores"; GPU model for NVIDIA/AMD).

Bare-metal safe (e.g. prime)

The NVIDIA nvidia-smi path is untouched. Neuron glob returns nothing on non-neuron hosts; IMDS fails silently off-AWS (no hang) → instance type just shows "Bare metal / unknown" and the GPU model still renders as today.

Validated

Live on inf2.8xlarge: IMDSv2 → inf2.8xlarge, glob → /dev/neuron0. Unit test for the neuron device regex. All changed Go packages build; Runners.tsx type-checks.

Next: Improvement 2 (explicit per-substrate runner placement).

🤖 Generated with Claude Code

chocobar and others added 2 commits June 25, 2026 16:25
…admin

Improvement 1: the admin Runners page now shows a selected runner's
architecture so an operator can pick a compatible profile.

- gpudetect: detect AWS Neuron devices (glob /dev/neuron*), one GPUStatus
  per chip with vendor=neuron. Returns nothing on non-neuron hosts.
- sandbox-heartbeat: detect cloud instance type via IMDSv2 (1.5s timeout,
  fail-silent). Empty on bare-metal (prime) and non-AWS - no hang.
- heartbeat payload + sandbox_instances row gain instance_type; GORM
  AutoMigrate adds the column.
- debug DTO surfaces per-runner instance_type, gpu_vendor and gpus.
- Runners.tsx: new Hardware card showing instance type + accelerator
  inventory (derives Neuron family + NeuronCore count from instance type;
  shows GPU model for NVIDIA/AMD; 'Bare metal / unknown' when no instance type).

Bare-metal-safe: NVIDIA path untouched; neuron glob + IMDS both no-op
cleanly off-AWS. Validated live on inf2.8xlarge (IMDS -> inf2.8xlarge,
/dev/neuron0 detected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review fixes:
- detectInstanceType: validate the IMDS body against an instance-type regex
  and cap the read, so a rogue HTTP 200 at the link-local IP (captive portal /
  on-prem metadata on a non-AWS host) can't store an oversized string that
  overflows varchar(100) and fails the whole heartbeat UPDATE. Also check the
  IMDSv2 token status. Adds a unit test for the validation.
- Cache the instance type after first success (it's immutable) so IMDS isn't
  hit every 30s heartbeat.
- Runners.tsx: prefer the per-accelerator vendor for Neuron (GPU_VENDOR env is
  empty on neuron hosts); fix 'AWS AWS Neuron' double-prefix when instance type
  is unavailable.
- Regenerate swagger for the new debug-DTO fields (swag v1.16.4).
@chocobar

Copy link
Copy Markdown
Collaborator Author

Review pass (4 agents) + fixes applied. Verified good: no inference-router regression (neuron GPUs correctly fail NVIDIA constraints), NeuronCore counts fact-checked vs AWS docs, timeout cannot block the heartbeat loop (literal IP → no DNS; 1.5s cap; bounded by the existing 5s probe budget), data flow confirmed end-to-end.

Fixed from review:

  • IMDS body now validated + length-capped (regex ^[a-z][a-z0-9-]{0,30}\.[a-z0-9]+$) + token status checked — closes a real bug where a rogue HTTP 200 at the link-local IP on a non-AWS host could overflow varchar(100) and fail the whole heartbeat UPDATE (flapping the runner offline). Unit test added.
  • Instance type cached after first success (immutable) — no IMDS hit every 30s.
  • UI: prefer per-accelerator vendor for Neuron; fixed an 'AWS AWS Neuron' double-prefix.
  • Swagger regenerated for the new DTO fields.

Deployment note: the detection runs in sandbox-heartbeat, baked into the sandbox image — so existing runners show 'Bare metal / unknown' until ./stack build-sandbox rebuilds and redeploys. The API/frontend half is hot-reloadable; the detection half is not.

@chocobar chocobar merged commit 094f846 into main Jun 25, 2026
5 checks passed
@chocobar chocobar deleted the feat/runner-hardware-visibility branch June 25, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant