Skip to content

fix: re-enable Torch-TensorRT model generation for SM 12.1#8860

Merged
Vinya567 merged 2 commits into
mainfrom
vinyak/verify-torchtrt-spark-sm121-l0
Jun 27, 2026
Merged

fix: re-enable Torch-TensorRT model generation for SM 12.1#8860
Vinya567 merged 2 commits into
mainfrom
vinyak/verify-torchtrt-spark-sm121-l0

Conversation

@Vinya567

Copy link
Copy Markdown
Contributor

What does the PR do?

Re-enables Torch-TensorRT QA model generation on devices with compute capability
12.1 (NVIDIA GB10 / DGX Spark) by removing a temporary skip in
qa/common/gen_qa_model_repository that was added when TensorRT lacked the
required convolution kernels for SM 12.1.

The skip is no longer needed: TensorRT 10.16.1.11 (shipped in the current
pytorch:26.05 base image) generates the required kernels successfully, so
torchtrt_model_store/resnet50_libtorch/1/model.pt builds end-to-end on real
GB10 hardware. Without this change the downstream test
L0_libtorch_torchtrt_image_models--PyTorch--DGX-Spark fails because the
model store is empty.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging
  • All template sections are filled out.

Commit Type:

  • fix

Related PRs:

  • A follow-up MR will be opened on the internal GitLab to remove
    allow_failure: true from the DGX Spark L0 test once a master nightly
    confirms it is green.

Where should the reviewer start?

  • qa/common/gen_qa_model_repository — single-line change removing the
    nvidia-smi --query-gpu=compute_cap | grep -qz 12.1 && echo WARNING || ...
    guard that previously skipped Torch-TRT model generation on SM 12.1.

Test plan:

  1. Verified on a real DGX Spark CI runner (SM 12.1 / GB10) that
    gen_qa_torchtrt_models.py produces a valid resnet50_libtorch/1/model.pt.
  2. Ran the consuming test L0_libtorch_torchtrt_image_models--PyTorch--DGX-Spark
    end-to-end on the same hardware against the freshly generated model store.
  3. Tritonserver loaded resnet50_libtorch, image_client.py produced the
    expected classification, test output: *** Test Passed ***.
  • CI Pipeline ID: 55894843

Caveats:

  • This change is a no-op on every compute capability other than SM 12.1
    (the guard only matched 12.1).
  • Only the DGX Spark GenModels-build job's behavior changes (from
    "skip with warning" to "build"); the only downstream consumer of the
    generated artifact is L0_libtorch_torchtrt_image_models.

Background

The SM 12.1 skip was introduced as a temporary workaround for an upstream
TensorRT kernel gap on Blackwell GB10. That gap has since been resolved in
the TensorRT version pulled by the current PyTorch container, so the
workaround is no longer needed.

Related Issues:

  • N/A (no public GitHub issue; tracked internally).

@Vinya567 Vinya567 added the PR: fix A bug fix label Jun 26, 2026
@Vinya567 Vinya567 requested review from mc-nv, whoisj and yinggeh June 26, 2026 16:54
@Vinya567 Vinya567 marked this pull request as ready for review June 26, 2026 17:36
mc-nv
mc-nv previously approved these changes Jun 26, 2026
whoisj
whoisj previously approved these changes Jun 26, 2026
Comment thread qa/common/gen_qa_model_repository
@Vinya567 Vinya567 dismissed stale reviews from whoisj and mc-nv via 4369983 June 26, 2026 18:14
@Vinya567 Vinya567 requested review from whoisj and yinggeh June 26, 2026 18:24
@Vinya567 Vinya567 merged commit 5e885a1 into main Jun 27, 2026
3 checks passed
@Vinya567 Vinya567 deleted the vinyak/verify-torchtrt-spark-sm121-l0 branch June 27, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: fix A bug fix

Development

Successfully merging this pull request may close these issues.

4 participants