Skip to content

[Bug]: Garbage output with modelopt FP8 checkpoint in 1.2.0: use_fp8_context_fmha forced True, regression from 1.1.0 #12877

@langzhao-netizen

Description

@langzhao-netizen

System Info

  • TensorRT-LLM version: 1.2.0
  • Model: LLaMA-3.1-8B-Instruct, LLaMA-3.2-3B-Instruct
  • Checkpoint: produced by NVIDIA modelopt, kv_cache_quant_algo: "FP8" in config
  • GPU: NVIDIA L4 (SM 89)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Checkpoint config.json:
{ "quantization": { "quant_algo": "FP8", "kv_cache_quant_algo": "FP8" } }

Build:
trtllm-build --checkpoint_dir --kv_cache_type paged

Note: passing --use_fp8_context_fmha disable is silently overridden — builder.py forces it back to True whenever kv_cache_quant_algo=FP8 and use_paged_context_fmha=True, logging: "FP8 Context FMHA is enabled to support FP8 Paged Context FMHA."

Expected behavior

Correct model output, same as TRT-LLM 1.1.0 with the same checkpoint.

actual behavior

Most prompts produce garbage / degenerate output (e.g. ##.##.##.##..., repetitive loops). Some very short prompts accidentally produce correct output.

additional notes

The relevant Python code in builder.py and modeling_utils.py is byte-for-byte identical between 1.1.0 and 1.2.0. Both versions force use_fp8_context_fmha=True and both synthesize attention_output_orig_quant_scale the same way:
scale = [1.0] / layer.dense.activation_scaling_factor.raw_value
layer.attention_output_orig_quant_scale = Parameter(value=scale.astype(np.float32), ...)

Modelopt checkpoints do not include attention_output_orig_quant_scale natively; TRT-LLM synthesizes it from
dense.activation_scaling_factor. Since the Python path is identical, the regression must lie in the compiled attention kernel.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Model optimization<NV>Model-specific performance optimizations and tuningbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions