NdIter + cpu `vec_add` and `vec_scalar_add` by ivarflakstad · Pull Request #3579 · huggingface/candle

ivarflakstad · 2026-06-04T21:02:53Z

Multi-dimensional iterator. Similar to numpy nditer and pytorch TensorIterator.

Initially this will be used to optimize binary paths, specifically for CPU in this PR, but with some additions we can use it to improve gpu performance as well.

No measurable impact on f32 contiguous binary, as rust and llvm already optimizes the zip loop perfectly, but for broadcasting (including scalar broadcast) we get really good performance now. Additionally (especially on neon with this PR) we get a huge perf improvement on f16/bf16, with the biggest change being bf16 broadcast throughput being up 7000% 👀

Benchmark results:

cpu_broadcast_add_contiguous_f32/iter
                        time:   [201.19 µs 205.62 µs 211.06 µs]
                        thrpt:  [107.60 GiB/s 110.45 GiB/s 112.88 GiB/s]
                 change:
                        time:   [-0.9185% -2.1972% -3.7325%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5982% +2.1500% +0.9102%]
                        Change within noise threshold.

cpu_broadcast_add_contiguous_f16/iter
                        time:   [90.170 µs 91.166 µs 92.177 µs]
                        thrpt:  [123.19 GiB/s 124.56 GiB/s 125.93 GiB/s]
                 change:
                        time:   [−81.504% −81.160% −80.844%] (p = 0.00 < 0.05)
                        thrpt:  [+422.04% +430.79% +440.67%]
                        Performance has improved.

cpu_broadcast_add_contiguous_bf16/iter
                        time:   [87.163 µs 89.211 µs 90.944 µs]
                        thrpt:  [124.86 GiB/s 127.29 GiB/s 130.28 GiB/s]
                 change:
                        time:   [−85.310% −84.980% −84.649%] (p = 0.00 < 0.05)
                        thrpt:  [+551.44% +565.77% +580.75%]
                        Performance has improved.

cpu_broadcast_add_f32/iter
                        time:   [66.682 µs 66.760 µs 66.847 µs]
                        thrpt:  [113.25 GiB/s 113.40 GiB/s 113.53 GiB/s]
                 change:
                        time:   [−97.975% −97.970% −97.964%] (p = 0.00 < 0.05)
                        thrpt:  [+4812.0% +4825.2% +4839.0%]
                        Performance has improved.

cpu_broadcast_add_f16/iter
                        time:   [47.606 µs 47.692 µs 47.789 µs]
                        thrpt:  [79.206 GiB/s 79.367 GiB/s 79.510 GiB/s]
                 change:
                        time:   [−98.470% −98.343% −98.141%] (p = 0.00 < 0.05)
                        thrpt:  [+5279.1% +5935.8% +6437.9%]
                        Performance has improved.

cpu_broadcast_add_bf16/iter
                        time:   [48.623 µs 48.703 µs 48.786 µs]
                        thrpt:  [77.587 GiB/s 77.718 GiB/s 77.847 GiB/s]
                 change:
                        time:   [−98.604% −98.593% −98.584%] (p = 0.00 < 0.05)
                        thrpt:  [+6961.2% +7005.8% +7062.5%]
                        Performance has improved.

cpu_broadcast_scalar_add_f32/iter
                        time:   [1.1965 µs 1.2166 µs 1.2368 µs]
                        thrpt:  [49.347 GiB/s 50.167 GiB/s 51.010 GiB/s]
                 change:
                        time:   [−92.107% −91.961% −91.826%] (p = 0.00 < 0.05)
                        thrpt:  [+1123.4% +1143.9% +1166.9%]
                        Performance has improved.

cpu_broadcast_scalar_add_f16/iter
                        time:   [684.61 ns 697.61 ns 711.07 ns]
                        thrpt:  [42.918 GiB/s 43.746 GiB/s 44.577 GiB/s]
                 change:
                        time:   [−95.644% −95.562% −95.472%] (p = 0.00 < 0.05)
                        thrpt:  [+2108.7% +2153.5% +2195.7%]
                        Performance has improved.

cpu_broadcast_scalar_add_bf16/iter
                        time:   [689.54 ns 706.61 ns 723.31 ns]
                        thrpt:  [42.192 GiB/s 43.189 GiB/s 44.258 GiB/s]
                 change:
                        time:   [−96.407% −96.310% −96.220%] (p = 0.00 < 0.05)
                        thrpt:  [+2545.3% +2610.1% +2683.4%]
                        Performance has improved.

…lar vec impls. Remove const delegation flags

EricLBuehler

Looks good!

ivarflakstad force-pushed the cpu-nditer branch from d31ae6b to efba6d7 Compare June 5, 2026 06:35

ivarflakstad added 5 commits June 5, 2026 13:04

Add contiguous binary add bench

e73b573

Add NdIter - efficient multidim iterator

36ee329

Update cpu unary and binary op traits. Add binary default vec and sca…

33d5733

…lar vec impls. Remove const delegation flags

Add more optimized cpu kernels for binary add and binary scalar add

99fa9d3

Wire up NdIter in cpu unary/binary. Also add binary scalar vec path.

b803e5b

ivarflakstad force-pushed the cpu-nditer branch from efba6d7 to b803e5b Compare June 5, 2026 11:04

ivarflakstad added 2 commits June 5, 2026 13:55

Move NdIter to its own file

cde9b42

clippy

fb1cce3

ivarflakstad marked this pull request as ready for review June 5, 2026 12:04

ivarflakstad requested a review from EricLBuehler June 5, 2026 12:04

ivarflakstad changed the title ~~NdIter~~ NdIter + cpu vec_add and vec_scalar_add Jun 5, 2026

EricLBuehler approved these changes Jun 10, 2026

View reviewed changes

EricLBuehler merged commit c848799 into main Jun 10, 2026
12 checks passed

EricLBuehler deleted the cpu-nditer branch June 10, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NdIter + cpu `vec_add` and `vec_scalar_add`#3579

NdIter + cpu `vec_add` and `vec_scalar_add`#3579
EricLBuehler merged 7 commits into
mainfrom
cpu-nditer

ivarflakstad commented Jun 4, 2026 •

edited

Loading

Uh oh!

EricLBuehler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivarflakstad commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricLBuehler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivarflakstad commented Jun 4, 2026 •

edited

Loading