[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export by Gasoonjia · Pull Request #20229 · pytorch/executorch

Gasoonjia · 2026-06-12T04:49:56Z

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths:

int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight)
tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint.

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths: - int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset) - CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor) - int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant) - backend fallback-kernel + custom_ops_to_c_shims registration; CMake build - route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight) - tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint. Signed-off-by: gasoonjia <gasoonjia@icloud.com>

pytorch-bot · 2026-06-12T04:50:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 19 Pending, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 7ad0a7e with merge base a79f3e4 ():

NEW FAILURES - The following jobs have failed:

Test CUDA Builds / unittest-cuda / linux-job (gh)
backends/cuda/tests/test_int4_dispatch.py::TestMultiLayer::test_two_layer_mlp
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh)
Process completed with exit code 1.
trunk / test-models-macos-coreml (mobilebert) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-mps / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

trunk / test-models-macos-cpu (llama3_2_vision_encoder, portable) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (mobilebert, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / test-models-macos-coreml (mv3) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229
Gasoonjia wants to merge 1 commit into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf

Gasoonjia commented Jun 12, 2026

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Jun 12, 2026

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

❌ 4 New Failures, 19 Pending, 1 Unrelated Failure, 2 Unclassified Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading