[Feature] gemm_v0 需要ping-pong流水操作,避免使用mma

## Feature Description
A clear and concise description of the feature you'd like to request.
gemm_v0 需要ping-pong流水操作
## Use Case
Describe the use case or scenario where this feature would be helpful.

## Proposed Solution
If you have a proposed solution or implementation idea, describe it here.

## Feature Type
- [ ] New operator/kernel (e.g., new GEMM variant, new attention mechanism)
- [ ] New API/primitive (e.g., new T.xxx operation)
- [ ] Compiler optimization (e.g., new pass, performance improvement)
- [ ] New backend support (e.g., new hardware target)
- [ ] Developer tooling (e.g., debugging, profiling)
- [ ] Documentation improvement
- [ ] Other: [specify]

## Operator Details (if applicable)
for i in T.serial(batch_iters):
    side = i % 2  # 当前槽位（0 或 1）
    idx = k * num_stages + i
    
    # === 步骤1: 加载 K 到 L1（与上次计算重叠） ===
    T.wait_flag("MTE1", "MTE2", SIG_K_L1)     # 等待 K 加载槽就绪
    T.copy(K[..., idx * block_N : (idx + 1) * block_N, :], k_l1)  # 加载 K
    T.set_flag("MTE2", "MTE1", SIG_K_L1)       # 通知 K 加载完成
    
    # === 步骤2: 加载 Q 到 L0A ===
    T.wait_flag("M", "MTE1", SIG_L0AB + side)  # 等待 L0A 槽就绪
    if i < 2:
        T.copy(q_l1, l0a[side, :, :])          # 加载 Q
    
    # === 步骤3: 加载 K 到 L0B（转置） ===
    T.wait_flag("MTE2", "MTE1", SIG_K_L1)      # 等待 K 加载完成
    T.copy(k_l1, l0b[side, :, :], transpose=True)  # 转置加载
    T.set_flag("MTE1", "MTE2", SIG_K_L1)       # 释放 K 加载槽
    T.set_flag("MTE1", "M", SIG_L0AB + side)   # 通知 L0A/B 就绪
    
    # === 步骤4: MMA 计算 ===
    T.wait_flag("MTE1", "M", SIG_L0AB + side)  # 等待 L0A/B 就绪
    T.wait_flag("FIX", "M", SIG_L0C + side)    # 等待 L0C 槽就绪
    T.mma(l0a[side, :, :], l0b[side, :, :], l0c[side, :, :], init=True)  # 矩阵乘
    T.set_flag("M", "MTE1", SIG_L0AB + side)   # 释放 L0A/B
    T.set_flag("M", "FIX", SIG_L0C + side)     # 通知 L0C 就绪
    
    # === 步骤5: 写回 workspace_1 ===
    T.wait_flag("M", "FIX", SIG_L0C + side)    # 等待 L0C 就绪
    T.copy(l0c[side, :, :], workspace_1[cid, i, :, :])  # 写回
    T.set_flag("FIX", "M", SIG_L0C + side)     # 释放 L0C
**Operator Type:**
[e.g. GEMM, Flash Attention, Softmax, LayerNorm, Convolution, etc.]

**Input/Output Specification:**
```
Input shapes and dtypes:
- A: [M, K], float16
- B: [K, N], float16

Output shapes and dtypes:
- C: [M, N], float16
```

**Performance Requirements:**
[e.g. target throughput, latency, memory bandwidth]

**Reference Implementation (if any):**
[e.g. PyTorch implementation, CUDA kernel, paper reference]

## Alternative Solutions
Describe any alternative solutions or features you've considered.

## Implementation Constraints (if known)

**Programming Mode Preference:**
- [ ] Developer mode (automatic)
- [ ] Expert mode (manual control)
- [x] Both modes supported

**Memory Constraints:**
[e.g. L1 buffer size, UB buffer size constraints]

**Target Hardware:**
- [x] Ascend A2
- [x] Ascend A3
- [ ] Other Ascend variants

## Additional Context
Add any other context, references, or screenshots about the feature request here.

## Willingness to Contribute
- [ ] I'm willing to submit a PR for this feature
- [ ] I can help with testing
- [ ] I can provide reference implementations
- [ ] I need guidance to get started

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] gemm_v0 需要ping-pong流水操作,避免使用mma #1237

Feature Description

Use Case

Proposed Solution

Feature Type

Operator Details (if applicable)

Alternative Solutions

Implementation Constraints (if known)

Additional Context

Willingness to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] gemm_v0 需要ping-pong流水操作,避免使用mma #1237

Description

Feature Description

Use Case

Proposed Solution

Feature Type

Operator Details (if applicable)

Alternative Solutions

Implementation Constraints (if known)

Additional Context

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions