Commit 35cf6a0
authored
[NPU]: optimize cross entropy kernel gradient computation (#1178)
- Hoist loop-invariant scalar computations (z_scale, one_minus_ls,
z_deriv) out of the inner loop to avoid redundant recalculation
- Fuse softmax, z-loss derivative, and smoothing term into a single
vector expression in the non-weighted branch
- Guard tl.where with block-range check (y >= i and y < i + BLOCK_SIZE)
to skip unnecessary vector operations when target index is not in the
current block
Hardware Type: Type: Atlas 800I A2
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence1 parent d991472 commit 35cf6a0
1 file changed
+14
-10
lines changedLines changed: 14 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
194 | 194 | | |
195 | 195 | | |
196 | 196 | | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
197 | 202 | | |
198 | 203 | | |
199 | 204 | | |
| |||
207 | 212 | | |
208 | 213 | | |
209 | 214 | | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
216 | 218 | | |
217 | | - | |
| 219 | + | |
| 220 | + | |
218 | 221 | | |
219 | 222 | | |
220 | 223 | | |
221 | 224 | | |
222 | 225 | | |
223 | 226 | | |
224 | 227 | | |
225 | | - | |
| 228 | + | |
226 | 229 | | |
227 | | - | |
| 230 | + | |
| 231 | + | |
228 | 232 | | |
229 | 233 | | |
230 | 234 | | |
231 | 235 | | |
232 | | - | |
| 236 | + | |
233 | 237 | | |
234 | 238 | | |
235 | 239 | | |
| |||
0 commit comments