Feature Description
A clear and concise description of the feature you'd like to request.
gemm_v0 需要ping-pong流水操作
Use Case
Describe the use case or scenario where this feature would be helpful.
Proposed Solution
If you have a proposed solution or implementation idea, describe it here.
Feature Type
Operator Details (if applicable)
for i in T.serial(batch_iters):
side = i % 2 # 当前槽位(0 或 1)
idx = k * num_stages + i
# === 步骤1: 加载 K 到 L1(与上次计算重叠) ===
T.wait_flag("MTE1", "MTE2", SIG_K_L1) # 等待 K 加载槽就绪
T.copy(K[..., idx * block_N : (idx + 1) * block_N, :], k_l1) # 加载 K
T.set_flag("MTE2", "MTE1", SIG_K_L1) # 通知 K 加载完成
# === 步骤2: 加载 Q 到 L0A ===
T.wait_flag("M", "MTE1", SIG_L0AB + side) # 等待 L0A 槽就绪
if i < 2:
T.copy(q_l1, l0a[side, :, :]) # 加载 Q
# === 步骤3: 加载 K 到 L0B(转置) ===
T.wait_flag("MTE2", "MTE1", SIG_K_L1) # 等待 K 加载完成
T.copy(k_l1, l0b[side, :, :], transpose=True) # 转置加载
T.set_flag("MTE1", "MTE2", SIG_K_L1) # 释放 K 加载槽
T.set_flag("MTE1", "M", SIG_L0AB + side) # 通知 L0A/B 就绪
# === 步骤4: MMA 计算 ===
T.wait_flag("MTE1", "M", SIG_L0AB + side) # 等待 L0A/B 就绪
T.wait_flag("FIX", "M", SIG_L0C + side) # 等待 L0C 槽就绪
T.mma(l0a[side, :, :], l0b[side, :, :], l0c[side, :, :], init=True) # 矩阵乘
T.set_flag("M", "MTE1", SIG_L0AB + side) # 释放 L0A/B
T.set_flag("M", "FIX", SIG_L0C + side) # 通知 L0C 就绪
# === 步骤5: 写回 workspace_1 ===
T.wait_flag("M", "FIX", SIG_L0C + side) # 等待 L0C 就绪
T.copy(l0c[side, :, :], workspace_1[cid, i, :, :]) # 写回
T.set_flag("FIX", "M", SIG_L0C + side) # 释放 L0C
Operator Type:
[e.g. GEMM, Flash Attention, Softmax, LayerNorm, Convolution, etc.]
Input/Output Specification:
Input shapes and dtypes:
- A: [M, K], float16
- B: [K, N], float16
Output shapes and dtypes:
- C: [M, N], float16
Performance Requirements:
[e.g. target throughput, latency, memory bandwidth]
Reference Implementation (if any):
[e.g. PyTorch implementation, CUDA kernel, paper reference]
Alternative Solutions
Describe any alternative solutions or features you've considered.
Implementation Constraints (if known)
Programming Mode Preference:
Memory Constraints:
[e.g. L1 buffer size, UB buffer size constraints]
Target Hardware:
Additional Context
Add any other context, references, or screenshots about the feature request here.
Willingness to Contribute
Feature Description
A clear and concise description of the feature you'd like to request.
gemm_v0 需要ping-pong流水操作
Use Case
Describe the use case or scenario where this feature would be helpful.
Proposed Solution
If you have a proposed solution or implementation idea, describe it here.
Feature Type
Operator Details (if applicable)
for i in T.serial(batch_iters):
side = i % 2 # 当前槽位(0 或 1)
idx = k * num_stages + i
Operator Type:
[e.g. GEMM, Flash Attention, Softmax, LayerNorm, Convolution, etc.]
Input/Output Specification:
Performance Requirements:
[e.g. target throughput, latency, memory bandwidth]
Reference Implementation (if any):
[e.g. PyTorch implementation, CUDA kernel, paper reference]
Alternative Solutions
Describe any alternative solutions or features you've considered.
Implementation Constraints (if known)
Programming Mode Preference:
Memory Constraints:
[e.g. L1 buffer size, UB buffer size constraints]
Target Hardware:
Additional Context
Add any other context, references, or screenshots about the feature request here.
Willingness to Contribute