[Bug]: Garbage output with modelopt FP8 checkpoint in 1.2.0: use_fp8_context_fmha forced True, regression from 1.1.0

### System Info

- TensorRT-LLM version: 1.2.0                             
- Model: LLaMA-3.1-8B-Instruct, LLaMA-3.2-3B-Instruct                                                                          
- Checkpoint: produced by NVIDIA modelopt, kv_cache_quant_algo: "FP8" in config
- GPU: NVIDIA L4 (SM 89)

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

 Checkpoint config.json:
  { "quantization": { "quant_algo": "FP8", "kv_cache_quant_algo": "FP8" } }

  Build:                                                                                                                         
  trtllm-build --checkpoint_dir <ckpt> --kv_cache_type paged
                                                                                                                                 
  Note: passing --use_fp8_context_fmha disable is silently overridden — builder.py forces it back to True whenever kv_cache_quant_algo=FP8 and use_paged_context_fmha=True, logging: "FP8 Context FMHA is enabled to support FP8 Paged Context FMHA."

### Expected behavior

Correct model output, same as TRT-LLM 1.1.0 with the same checkpoint.

### actual behavior

Most prompts produce garbage / degenerate output (e.g. ##.##.##.##..., repetitive loops). Some very short prompts accidentally produce correct output.

### additional notes

The relevant Python code in builder.py and modeling_utils.py is byte-for-byte identical between 1.1.0 and 1.2.0. Both versions force use_fp8_context_fmha=True and both synthesize attention_output_orig_quant_scale the same way:                            
    scale = [1.0] / layer.dense.activation_scaling_factor.raw_value                                                                
  layer.attention_output_orig_quant_scale = Parameter(value=scale.astype(np.float32), ...)                                       
                                                                                                                                 
  Modelopt checkpoints do not include attention_output_orig_quant_scale natively; TRT-LLM synthesizes it from                    
  dense.activation_scaling_factor. Since the Python path is identical, the regression must lie in the compiled attention kernel.
                                                                                                                                

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Garbage output with modelopt FP8 checkpoint in 1.2.0: use_fp8_context_fmha forced True, regression from 1.1.0 #12877

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Garbage output with modelopt FP8 checkpoint in 1.2.0: use_fp8_context_fmha forced True, regression from 1.1.0 #12877

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions