feat(runners): show runner hardware (instance type + accelerator) in admin#2734
Merged
Conversation
…admin Improvement 1: the admin Runners page now shows a selected runner's architecture so an operator can pick a compatible profile. - gpudetect: detect AWS Neuron devices (glob /dev/neuron*), one GPUStatus per chip with vendor=neuron. Returns nothing on non-neuron hosts. - sandbox-heartbeat: detect cloud instance type via IMDSv2 (1.5s timeout, fail-silent). Empty on bare-metal (prime) and non-AWS - no hang. - heartbeat payload + sandbox_instances row gain instance_type; GORM AutoMigrate adds the column. - debug DTO surfaces per-runner instance_type, gpu_vendor and gpus. - Runners.tsx: new Hardware card showing instance type + accelerator inventory (derives Neuron family + NeuronCore count from instance type; shows GPU model for NVIDIA/AMD; 'Bare metal / unknown' when no instance type). Bare-metal-safe: NVIDIA path untouched; neuron glob + IMDS both no-op cleanly off-AWS. Validated live on inf2.8xlarge (IMDS -> inf2.8xlarge, /dev/neuron0 detected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review fixes: - detectInstanceType: validate the IMDS body against an instance-type regex and cap the read, so a rogue HTTP 200 at the link-local IP (captive portal / on-prem metadata on a non-AWS host) can't store an oversized string that overflows varchar(100) and fails the whole heartbeat UPDATE. Also check the IMDSv2 token status. Adds a unit test for the validation. - Cache the instance type after first success (it's immutable) so IMDS isn't hit every 30s heartbeat. - Runners.tsx: prefer the per-accelerator vendor for Neuron (GPU_VENDOR env is empty on neuron hosts); fix 'AWS AWS Neuron' double-prefix when instance type is unavailable. - Regenerate swagger for the new debug-DTO fields (swag v1.16.4).
Collaborator
Author
|
Review pass (4 agents) + fixes applied. Verified good: no inference-router regression (neuron GPUs correctly fail NVIDIA constraints), NeuronCore counts fact-checked vs AWS docs, timeout cannot block the heartbeat loop (literal IP → no DNS; 1.5s cap; bounded by the existing 5s probe budget), data flow confirmed end-to-end. Fixed from review:
Deployment note: the detection runs in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improvement 1 of 2 (runner hardware visibility). On the admin Runners page, selecting a runner now shows its architecture so you can pick a compatible profile.
What it adds
gpudetect(glob/dev/neuron*→ oneGPUStatusper chip, vendor=neuron). Today neuron runners report an empty GPU list.sandbox-heartbeat(1.5s timeout, fail-silent).instance_typeon the heartbeat payload +sandbox_instancesrow (GORM AutoMigrate adds the column); debug DTO surfaces per-runnerinstance_type/gpu_vendor/gpus.Runners.tsx: instance type + accelerator inventory (derives Neuron family + NeuronCore count from the instance type, e.g. "1 × AWS Inferentia2 — 2 NeuronCores"; GPU model for NVIDIA/AMD).Bare-metal safe (e.g. prime)
The NVIDIA
nvidia-smipath is untouched. Neuron glob returns nothing on non-neuron hosts; IMDS fails silently off-AWS (no hang) → instance type just shows "Bare metal / unknown" and the GPU model still renders as today.Validated
Live on inf2.8xlarge: IMDSv2 →
inf2.8xlarge, glob →/dev/neuron0. Unit test for the neuron device regex. All changed Go packages build; Runners.tsx type-checks.Next: Improvement 2 (explicit per-substrate runner placement).
🤖 Generated with Claude Code