Skip to content

[Feature] gemm_v0 需要ping-pong流水操作,避免使用mma #1237

Description

@liuwenda4

Feature Description

A clear and concise description of the feature you'd like to request.
gemm_v0 需要ping-pong流水操作

Use Case

Describe the use case or scenario where this feature would be helpful.

Proposed Solution

If you have a proposed solution or implementation idea, describe it here.

Feature Type

  • New operator/kernel (e.g., new GEMM variant, new attention mechanism)
  • New API/primitive (e.g., new T.xxx operation)
  • Compiler optimization (e.g., new pass, performance improvement)
  • New backend support (e.g., new hardware target)
  • Developer tooling (e.g., debugging, profiling)
  • Documentation improvement
  • Other: [specify]

Operator Details (if applicable)

for i in T.serial(batch_iters):
side = i % 2 # 当前槽位(0 或 1)
idx = k * num_stages + i

# === 步骤1: 加载 K 到 L1(与上次计算重叠) ===
T.wait_flag("MTE1", "MTE2", SIG_K_L1)     # 等待 K 加载槽就绪
T.copy(K[..., idx * block_N : (idx + 1) * block_N, :], k_l1)  # 加载 K
T.set_flag("MTE2", "MTE1", SIG_K_L1)       # 通知 K 加载完成

# === 步骤2: 加载 Q 到 L0A ===
T.wait_flag("M", "MTE1", SIG_L0AB + side)  # 等待 L0A 槽就绪
if i < 2:
    T.copy(q_l1, l0a[side, :, :])          # 加载 Q

# === 步骤3: 加载 K 到 L0B(转置) ===
T.wait_flag("MTE2", "MTE1", SIG_K_L1)      # 等待 K 加载完成
T.copy(k_l1, l0b[side, :, :], transpose=True)  # 转置加载
T.set_flag("MTE1", "MTE2", SIG_K_L1)       # 释放 K 加载槽
T.set_flag("MTE1", "M", SIG_L0AB + side)   # 通知 L0A/B 就绪

# === 步骤4: MMA 计算 ===
T.wait_flag("MTE1", "M", SIG_L0AB + side)  # 等待 L0A/B 就绪
T.wait_flag("FIX", "M", SIG_L0C + side)    # 等待 L0C 槽就绪
T.mma(l0a[side, :, :], l0b[side, :, :], l0c[side, :, :], init=True)  # 矩阵乘
T.set_flag("M", "MTE1", SIG_L0AB + side)   # 释放 L0A/B
T.set_flag("M", "FIX", SIG_L0C + side)     # 通知 L0C 就绪

# === 步骤5: 写回 workspace_1 ===
T.wait_flag("M", "FIX", SIG_L0C + side)    # 等待 L0C 就绪
T.copy(l0c[side, :, :], workspace_1[cid, i, :, :])  # 写回
T.set_flag("FIX", "M", SIG_L0C + side)     # 释放 L0C

Operator Type:
[e.g. GEMM, Flash Attention, Softmax, LayerNorm, Convolution, etc.]

Input/Output Specification:

Input shapes and dtypes:
- A: [M, K], float16
- B: [K, N], float16

Output shapes and dtypes:
- C: [M, N], float16

Performance Requirements:
[e.g. target throughput, latency, memory bandwidth]

Reference Implementation (if any):
[e.g. PyTorch implementation, CUDA kernel, paper reference]

Alternative Solutions

Describe any alternative solutions or features you've considered.

Implementation Constraints (if known)

Programming Mode Preference:

  • Developer mode (automatic)
  • Expert mode (manual control)
  • Both modes supported

Memory Constraints:
[e.g. L1 buffer size, UB buffer size constraints]

Target Hardware:

  • Ascend A2
  • Ascend A3
  • Other Ascend variants

Additional Context

Add any other context, references, or screenshots about the feature request here.

Willingness to Contribute

  • I'm willing to submit a PR for this feature
  • I can help with testing
  • I can provide reference implementations
  • I need guidance to get started

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions