Skip to content

Commit ae4b0a4

Browse files
authored
[cuda] int4: stabilize two-layer decode test via CUDA-seeded init (#20196)
_make_int4_linear built the throwaway nn.Linear on CPU, so reset_parameters() drew from the CPU RNG between the two layer constructions and shifted the stream that seeds the quantized weights. That pushed test_two_layer_mlp's genuine INT4 error from 0.1405 to 0.1556, crossing the 0.15 bound. Build the module with device=cuda so init draws from the CUDA RNG, leaving the CPU stream (and the measured error) deterministic. Test-only; dequant math is unchanged.
1 parent 4519036 commit ae4b0a4

1 file changed

Lines changed: 4 additions & 1 deletion

File tree

backends/cuda/tests/test_int4_dispatch.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,10 @@ def _make_int4_linear(N, K, group_size=128, symmetric=False, bias=False):
5959
)
6060
int4_w = quantize_weight(w_bf16, config)
6161

62-
module = nn.Linear(K, N, bias=bias, dtype=torch.bfloat16)
62+
# device="cuda" so the random init draws from the CUDA RNG to match the
63+
# same random weight as regular int4 dispatch and fit the same numerical
64+
# error tolerance.
65+
module = nn.Linear(K, N, bias=bias, dtype=torch.bfloat16, device="cuda")
6366
pack_linear_for_cuda(module, {"weight": int4_w})
6467
module.cuda()
6568
return module, w_bf16.cuda()

0 commit comments

Comments
 (0)