Skip to content

Commit c430fe2

Browse files
committed
Merge remote-tracking branch 'origin/main' into feat/minimax-m3-gb200-sweep
# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml
2 parents 0448353 + 86e7761 commit c430fe2

45 files changed

Lines changed: 2567 additions & 90 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/commands/recover-failed-ingest.md

Lines changed: 70 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
description: Recover a failed main-branch sweep ingest through the normal artifact-reuse path without rerunning GPU benchmarks
3-
argument-hint: <failed-run-or-job-url> [source-run-id]
3+
argument-hint: <failed-run-or-job-url | pr-number> [source-run-id]
44
---
55

66
Recover the official database ingest for a failed or skipped InferenceX
@@ -13,6 +13,12 @@ Inputs from `$ARGUMENTS`:
1313
- Use the optional second argument as `SOURCE_RUN_ID`; treat it as a candidate
1414
until all source, ancestry, scope, and artifact checks pass.
1515

16+
The most common invocation is a forgotten `/reuse-sweep-run` before merge, where
17+
you are handed the original PR number and/or its `pull_request` sweep run (the
18+
source) rather than a target URL. The failed target is then the push-to-main run
19+
on that PR's merge commit — derive it in step 1. `inspect-target` needs a
20+
run/job URL, not a bare ID.
21+
1622
Run from a clean InferenceX checkout with authenticated `gh`, `git`, `jq`, and
1723
`python3`. Stop on any unexpected command failure.
1824

@@ -50,10 +56,13 @@ Run from a clean InferenceX checkout with authenticated `gh`, `git`, `jq`, and
5056

5157
## 1. Inspect the target
5258

53-
Install helper dependencies and inspect the target:
59+
Ensure `pydantic` and `pyyaml` are importable
60+
(`python3 -c 'import pydantic, yaml'`); they are usually already present. If not,
61+
install them — a plain `pip install` fails on PEP 668 managed Pythons, so use a
62+
venv or `--break-system-packages`. Then inspect the target:
5463

5564
```bash
56-
python3 -m pip install pydantic pyyaml
65+
python3 -m pip install pydantic pyyaml # only if the import check failed
5766
python3 utils/recover_failed_ingest.py inspect-target \
5867
"$FAILED_RUN_OR_JOB_URL" \
5968
--output /tmp/infx-recovery-target.json
@@ -83,9 +92,33 @@ ORIGINAL_PR=$(gh api \
8392
--jq 'if length == 1 then .[0].number else error("expected one PR") end')
8493
```
8594

86-
Require event `push`, status `completed`, workflow path
87-
`.github/workflows/run-sweep.yml`, branch `main`, and a failed or skipped state
88-
that explains the missing ingest. Record the original PR and root cause.
95+
If you were given the original PR number or the source sweep run instead of a
96+
target URL — the usual forgotten-`/reuse` case — derive the target push run from
97+
the PR's merge commit:
98+
99+
```bash
100+
ORIGINAL_PR=<pr-number>
101+
ORIGINAL_MERGE_SHA=$(gh pr view "$ORIGINAL_PR" \
102+
--repo SemiAnalysisAI/InferenceX \
103+
--json mergeCommit --jq .mergeCommit.oid)
104+
gh run list --repo SemiAnalysisAI/InferenceX \
105+
--workflow run-sweep.yml --event push \
106+
--commit "$ORIGINAL_MERGE_SHA" --limit 5 \
107+
--json databaseId,status,conclusion,createdAt
108+
TARGET_RUN_ID=<matching-run-id>
109+
```
110+
111+
Require event `push`, workflow path `.github/workflows/run-sweep.yml`, and branch
112+
`main`; confirm the target is no longer running before recovering. The
113+
disqualifying state is broader than `failure`/`skipped`: when `/reuse-sweep-run`
114+
was forgotten before merge, `reuse-ingest-artifacts` is skipped, the GPU jobs run
115+
(often `cancelled` to save cost), and because `collect-results`/`collect-evals`
116+
are not skipped, `trigger-ingest` still fires `always()` and lands a *bogus*
117+
ingest under the target's own `run_id`. So a target showing
118+
`trigger-ingest=success` (and concluding `success` or `cancelled`) can still hold
119+
no valid benchmark data — recovery is required. That bogus row is keyed on the
120+
target `run_id` and is superseded by the recovery ingest under a new `run_id`;
121+
leave it alone. Record the original PR and root cause.
89122

90123
Fetch history and inspect the exact original changelog delta:
91124

@@ -249,7 +282,9 @@ Confirm the generated config contains only the intended recovery scope. Its row
249282
counts may differ from the source run.
250283

251284
Download only the result artifacts needed for local validation. This avoids the
252-
large server-log artifacts retained in the official ingest bundle:
285+
large server-log artifacts retained in the official ingest bundle. Raw per-config
286+
`bmk_<model>_*` artifacts are intentionally not selected — they fall through the
287+
`case` below; the aggregate `results_bmk` is what the validator reads:
253288

254289
```bash
255290
rm -rf /tmp/source-artifacts
@@ -351,6 +386,21 @@ gh pr checks "$RECOVERY_PR" \
351386
--watch --fail-fast
352387
```
353388

389+
`reuse-sweep-gate` appears only once the `pull_request` `run-sweep.yml` run for
390+
the new head SHA registers; immediately after pushing, `gh pr checks` may list
391+
only CodeQL/`check-changelog`/`comment`. Confirm that run exists and carries
392+
`reuse-sweep-gate` before trusting a green result, or watch it directly:
393+
394+
```bash
395+
gh run list --repo SemiAnalysisAI/InferenceX \
396+
--workflow run-sweep.yml --event pull_request \
397+
--branch "$BRANCH" --limit 5 \
398+
--json databaseId,status,conclusion,headSha
399+
```
400+
401+
On the PR (`pull_request`) gate, `setup` is itself skipped and `reuse-sweep-gate`
402+
does the validation; `setup` only runs on the push-to-main run in step 8.
403+
354404
## 8. Merge and verify official ingest
355405

356406
Keep the verified carrier commit as the PR head through merge. This repository
@@ -398,18 +448,28 @@ The push-to-main `Run Sweep` must:
398448
- run `trigger-ingest`.
399449

400450
Then locate the resulting `repository_dispatch` run in
401-
`SemiAnalysisAI/InferenceX-app`:
451+
`SemiAnalysisAI/InferenceX-app`. In the forgotten-`/reuse` case the target's
452+
bogus ingest is also a recent successful `ingest-results` run, so do not pick by
453+
recency — pick the run whose `Download artifacts from InferenceX run` step logs
454+
`RUN_ID: <RECOVERY_RUN_ID>`:
402455

403456
```bash
404457
gh run list --repo SemiAnalysisAI/InferenceX-app \
405458
--workflow "Ingest Benchmark Results" \
406-
--event repository_dispatch --limit 10
459+
--event repository_dispatch --limit 10 \
460+
--json databaseId,status,conclusion,createdAt
461+
462+
INGEST_RUN_ID=<candidate-run-id>
463+
gh run view "$INGEST_RUN_ID" --repo SemiAnalysisAI/InferenceX-app --log \
464+
| grep -m1 "RUN_ID: $RECOVERY_RUN_ID" # must match before you trust this run
407465

408-
INGEST_RUN_ID=<run-triggered-by-recovery>
409466
gh run watch "$INGEST_RUN_ID" \
410467
--repo SemiAnalysisAI/InferenceX-app --exit-status
411468
```
412469

470+
The ingest's first step is a `sleep 300` "wait for source run to finish", so the
471+
run idles ~5 minutes before doing work — that is normal, not a hang.
472+
413473
Verify its logs identify `RECOVERY_RUN_ID` as the trigger and `SOURCE_RUN_ID`
414474
plus `SOURCE_RUN_ATTEMPT` as the reused source. Require successful artifact
415475
download, flattening, database ingest, run overrides, database verification,

.github/configs/amd-master.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2612,6 +2612,25 @@ minimaxm3-fp4-mi355x-atom:
26122612
- { tp: 4, conc-start: 1, conc-end: 256 }
26132613
- { tp: 8, conc-start: 1, conc-end: 2 }
26142614

2615+
minimaxm3-fp8-mi355x-atom-mtp:
2616+
image: rocm/atom-dev:MiniMax-M3-20260622
2617+
model: MiniMaxAI/MiniMax-M3-MXFP8
2618+
model-prefix: minimaxm3
2619+
runner: mi355x
2620+
precision: fp8
2621+
framework: atom
2622+
multinode: false
2623+
scenarios:
2624+
fixed-seq-len:
2625+
- isl: 1024
2626+
osl: 1024
2627+
search-space:
2628+
- { tp: 4, conc-start: 1, conc-end: 256, spec-decoding: mtp }
2629+
- isl: 8192
2630+
osl: 1024
2631+
search-space:
2632+
- { tp: 4, conc-start: 1, conc-end: 256, spec-decoding: mtp }
2633+
26152634
# MiniMax-M3 MXFP8 MI300X day-zero recipe. Reuse the dedicated ROCm image and
26162635
# MI355X serving shape, but retain the default BF16 KV cache because this
26172636
# checkpoint lacks calibrated ROCm FP8 attention scales. Use the TP8-only H100

0 commit comments

Comments
 (0)