System Info
- TensorRT-LLM version: 1.2.0
- Model: LLaMA-3.1-8B-Instruct, LLaMA-3.2-3B-Instruct
- Checkpoint: produced by NVIDIA modelopt, kv_cache_quant_algo: "FP8" in config
- GPU: NVIDIA L4 (SM 89)
Who can help?
No response
Information
Tasks
Reproduction
Checkpoint config.json:
{ "quantization": { "quant_algo": "FP8", "kv_cache_quant_algo": "FP8" } }
Build:
trtllm-build --checkpoint_dir --kv_cache_type paged
Note: passing --use_fp8_context_fmha disable is silently overridden — builder.py forces it back to True whenever kv_cache_quant_algo=FP8 and use_paged_context_fmha=True, logging: "FP8 Context FMHA is enabled to support FP8 Paged Context FMHA."
Expected behavior
Correct model output, same as TRT-LLM 1.1.0 with the same checkpoint.
actual behavior
Most prompts produce garbage / degenerate output (e.g. ##.##.##.##..., repetitive loops). Some very short prompts accidentally produce correct output.
additional notes
The relevant Python code in builder.py and modeling_utils.py is byte-for-byte identical between 1.1.0 and 1.2.0. Both versions force use_fp8_context_fmha=True and both synthesize attention_output_orig_quant_scale the same way:
scale = [1.0] / layer.dense.activation_scaling_factor.raw_value
layer.attention_output_orig_quant_scale = Parameter(value=scale.astype(np.float32), ...)
Modelopt checkpoints do not include attention_output_orig_quant_scale natively; TRT-LLM synthesizes it from
dense.activation_scaling_factor. Since the Python path is identical, the regression must lie in the compiled attention kernel.
Before submitting a new issue...
System Info
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Checkpoint config.json:
{ "quantization": { "quant_algo": "FP8", "kv_cache_quant_algo": "FP8" } }
Build:
trtllm-build --checkpoint_dir --kv_cache_type paged
Note: passing --use_fp8_context_fmha disable is silently overridden — builder.py forces it back to True whenever kv_cache_quant_algo=FP8 and use_paged_context_fmha=True, logging: "FP8 Context FMHA is enabled to support FP8 Paged Context FMHA."
Expected behavior
Correct model output, same as TRT-LLM 1.1.0 with the same checkpoint.
actual behavior
Most prompts produce garbage / degenerate output (e.g. ##.##.##.##..., repetitive loops). Some very short prompts accidentally produce correct output.
additional notes
The relevant Python code in builder.py and modeling_utils.py is byte-for-byte identical between 1.1.0 and 1.2.0. Both versions force use_fp8_context_fmha=True and both synthesize attention_output_orig_quant_scale the same way:
scale = [1.0] / layer.dense.activation_scaling_factor.raw_value
layer.attention_output_orig_quant_scale = Parameter(value=scale.astype(np.float32), ...)
Modelopt checkpoints do not include attention_output_orig_quant_scale natively; TRT-LLM synthesizes it from
dense.activation_scaling_factor. Since the Python path is identical, the regression must lie in the compiled attention kernel.
Before submitting a new issue...