I was analyzing a task that involves both audio and visual inputs.
In order to analyze the attention map, I enabled the output_attentions = True option. However, the last token ( corresponding to assistant\n in the prompt <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<video>\nIs the man making sound in the audio?<|im_end|>\n<|im_start|>assistant\n ) turns into NaN.
After checking, I noticed that enabling output_attentions = True forces attn_implementation to become 'eager'. Indeed, explicitly setting attn_implementation = 'eager' produces the same issue.
Is there any known fix or information about this bug?
After further investigation, I observed that the attention maps remain normal up to the 27th layer, but at the 28th (final) layer, NaN values appear in the attention map.
I was analyzing a task that involves both audio and visual inputs.
In order to analyze the attention map, I enabled the
output_attentions = Trueoption. However, the last token ( corresponding toassistant\nin the prompt<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<video>\nIs the man making sound in the audio?<|im_end|>\n<|im_start|>assistant\n) turns into NaN.After checking, I noticed that enabling
output_attentions = Trueforces attn_implementation to become 'eager'. Indeed, explicitly settingattn_implementation = 'eager'produces the same issue.Is there any known fix or information about this bug?
After further investigation, I observed that the attention maps remain normal up to the 27th layer, but at the 28th (final) layer, NaN values appear in the attention map.