Skip to content

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229

Draft
Gasoonjia wants to merge 1 commit into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf
Draft

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229
Gasoonjia wants to merge 1 commit into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf

Conversation

@Gasoonjia

Copy link
Copy Markdown
Contributor

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths:

  • int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
  • CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
  • int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
  • backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
  • route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight)
  • tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint.

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend,
parallel to the int4/int8 plain_mm paths:
- int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
- CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
- int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
- backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
- route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight)
- tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from
unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the
inference sanity check + export via the GGUF loader (--gguf) instead of the
prequantized HF checkpoint.

Signed-off-by: gasoonjia <gasoonjia@icloud.com>
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 19 Pending, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 7ad0a7e with merge base a79f3e4 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant