Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
I am running on a Thor chip:
python /workspace/hupeng/cutlass-main/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py --in_dtype Float8E4M3FN --out_dtype Float8E4M3FN
I found that the maximum error reaches 0.6. Is this a bit too large?
Steps/Code to reproduce bug
python /workspace/hupeng/cutlass-main/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py --in_dtype Float8E4M3FN --out_dtype Float8E4M3FN
Expected behavior
I noticed that the error tolerance set in the code is 0.13 or 0.5. Would it be more reasonable for the error to be lower than these values? Could this large error be due to the specific characteristics of the Thor chip?
Environment details (please complete the following information):
nvidia-cutlass-dsl 4.5.2
nvidia-cutlass-dsl-libs-base 4.5.2
cuda-12.8
Additional context

Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
I am running on a Thor chip:
python /workspace/hupeng/cutlass-main/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py --in_dtype Float8E4M3FN --out_dtype Float8E4M3FNI found that the maximum error reaches 0.6. Is this a bit too large?
Steps/Code to reproduce bug
python /workspace/hupeng/cutlass-main/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py --in_dtype Float8E4M3FN --out_dtype Float8E4M3FN
Expected behavior
I noticed that the error tolerance set in the code is 0.13 or 0.5. Would it be more reasonable for the error to be lower than these values? Could this large error be due to the specific characteristics of the Thor chip?
Environment details (please complete the following information):
nvidia-cutlass-dsl 4.5.2
nvidia-cutlass-dsl-libs-base 4.5.2
cuda-12.8
Additional context