Skip to content

gemm subgroup optimization breaks on adreno #6778

Description

@nihui
soc driver work out-of-box subgroupMemoryBarrier hack disable subgroup ops
855p system driver 512.502.0 no works works
855p mesa turnip 26.1.2 no no works
8elite system driver 512.800.64 yes / /
8elite mesa turnip 26.1.2 yes / /
8elitegen5 system driver 512.842.19 no no works
8elitegen5 mesa turnip 26.1.2 yes / /

reproduce steps

  1. compile ncnn example ppocrv5 patched with query VK_KHR_shader_subgroup_extended_types features, sanitize fp16 subgroup path #6780
  2. prepare mobile model files from https://github.com/nihui/ncnn-android-ppocrv5/tree/master/app/src/main/assets
  3. ./ppocrv5 test.jpg

expected output

H 0.98702 at 517.01 1078.97 33.54 x 742.46  @ 89.44  =  执行标准:GB/T26701-2011、Q/MINISO341-2024

on some chips, you will get garbage output

H 0.98622 at 517.02 1078.96 33.55 x 742.48  @ 89.44  =  砘毽斜荤敏天麔籗昈拧雗酯∖戍壼呓瘕|剺庂剜谭≇蠖親∓桅🐏ʎ古懻叁/M蝚盏忏偁パ₍偎刃蕗饍眃杄燬羵🌅负従嶙♋轨鶺垕Ã览蘁汰鍉⏝欵鬼鴀颟

subgroupMemoryBarrier hack

edit src/layer/vulkan/shader/gemm_sg.comp for extra subgroupMemoryBarrier() before and after any subgroup shuffle blocks and re-compile

for (int z = 0; z < UNROLL_SG_K; z++)
{
subgroupMemoryBarrier();
    afpvec4 aa = subgroupShuffle(a, smi * UNROLL_SG_K + z);
    afpvec4 bb = subgroupShuffle(b, sni * UNROLL_SG_K + z);

subgroupMemoryBarrier();
    sum0 += aa.r * bb;
    sum1 += aa.g * bb;
    sum2 += aa.b * bb;
    sum3 += aa.a * bb;
}

and

for (int z = 0; z < UNROLL_SG_K && k + z < psc(K); z++)
{
subgroupMemoryBarrier();
    afpvec4 aa = subgroupShuffle(a, smi * UNROLL_SG_K + z);
    afpvec4 bb = subgroupShuffle(b, sni * UNROLL_SG_K + z);

subgroupMemoryBarrier();
    sum0 += aa.r * bb;
    sum1 += aa.g * bb;
    sum2 += aa.b * bb;
    sum3 += aa.a * bb;
}

disable subgroup ops

edit examples/ppocrv5.cpp for setting option before loading rec model

ppocrv5_rec.opt.use_subgroup_ops = false;

ppocrv5_rec.load_param("PP_OCRv5_mobile_rec.ncnn.param");
ppocrv5_rec.load_model("PP_OCRv5_mobile_rec.ncnn.bin");

Known ineffective workarounds/hacks

  • disable fp16 packed/storage/arithmetic but leaves use_subgroup_ops = true

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions