Skip to content

NdIter + cpu vec_add and vec_scalar_add#3579

Merged
EricLBuehler merged 7 commits into
mainfrom
cpu-nditer
Jun 10, 2026
Merged

NdIter + cpu vec_add and vec_scalar_add#3579
EricLBuehler merged 7 commits into
mainfrom
cpu-nditer

Conversation

@ivarflakstad

@ivarflakstad ivarflakstad commented Jun 4, 2026

Copy link
Copy Markdown
Member

Multi-dimensional iterator. Similar to numpy nditer and pytorch TensorIterator.

Initially this will be used to optimize binary paths, specifically for CPU in this PR, but with some additions we can use it to improve gpu performance as well.

No measurable impact on f32 contiguous binary, as rust and llvm already optimizes the zip loop perfectly, but for broadcasting (including scalar broadcast) we get really good performance now. Additionally (especially on neon with this PR) we get a huge perf improvement on f16/bf16, with the biggest change being bf16 broadcast throughput being up 7000% 👀

Benchmark results:
cpu_broadcast_add_contiguous_f32/iter
                        time:   [201.19 µs 205.62 µs 211.06 µs]
                        thrpt:  [107.60 GiB/s 110.45 GiB/s 112.88 GiB/s]
                 change:
                        time:   [-0.9185% -2.1972% -3.7325%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5982% +2.1500% +0.9102%]
                        Change within noise threshold.

cpu_broadcast_add_contiguous_f16/iter
                        time:   [90.170 µs 91.166 µs 92.177 µs]
                        thrpt:  [123.19 GiB/s 124.56 GiB/s 125.93 GiB/s]
                 change:
                        time:   [−81.504% −81.160% −80.844%] (p = 0.00 < 0.05)
                        thrpt:  [+422.04% +430.79% +440.67%]
                        Performance has improved.

cpu_broadcast_add_contiguous_bf16/iter
                        time:   [87.163 µs 89.211 µs 90.944 µs]
                        thrpt:  [124.86 GiB/s 127.29 GiB/s 130.28 GiB/s]
                 change:
                        time:   [−85.310% −84.980% −84.649%] (p = 0.00 < 0.05)
                        thrpt:  [+551.44% +565.77% +580.75%]
                        Performance has improved.

cpu_broadcast_add_f32/iter
                        time:   [66.682 µs 66.760 µs 66.847 µs]
                        thrpt:  [113.25 GiB/s 113.40 GiB/s 113.53 GiB/s]
                 change:
                        time:   [−97.975% −97.970% −97.964%] (p = 0.00 < 0.05)
                        thrpt:  [+4812.0% +4825.2% +4839.0%]
                        Performance has improved.

cpu_broadcast_add_f16/iter
                        time:   [47.606 µs 47.692 µs 47.789 µs]
                        thrpt:  [79.206 GiB/s 79.367 GiB/s 79.510 GiB/s]
                 change:
                        time:   [−98.470% −98.343% −98.141%] (p = 0.00 < 0.05)
                        thrpt:  [+5279.1% +5935.8% +6437.9%]
                        Performance has improved.

cpu_broadcast_add_bf16/iter
                        time:   [48.623 µs 48.703 µs 48.786 µs]
                        thrpt:  [77.587 GiB/s 77.718 GiB/s 77.847 GiB/s]
                 change:
                        time:   [−98.604% −98.593% −98.584%] (p = 0.00 < 0.05)
                        thrpt:  [+6961.2% +7005.8% +7062.5%]
                        Performance has improved.

cpu_broadcast_scalar_add_f32/iter
                        time:   [1.1965 µs 1.2166 µs 1.2368 µs]
                        thrpt:  [49.347 GiB/s 50.167 GiB/s 51.010 GiB/s]
                 change:
                        time:   [−92.107% −91.961% −91.826%] (p = 0.00 < 0.05)
                        thrpt:  [+1123.4% +1143.9% +1166.9%]
                        Performance has improved.

cpu_broadcast_scalar_add_f16/iter
                        time:   [684.61 ns 697.61 ns 711.07 ns]
                        thrpt:  [42.918 GiB/s 43.746 GiB/s 44.577 GiB/s]
                 change:
                        time:   [−95.644% −95.562% −95.472%] (p = 0.00 < 0.05)
                        thrpt:  [+2108.7% +2153.5% +2195.7%]
                        Performance has improved.

cpu_broadcast_scalar_add_bf16/iter
                        time:   [689.54 ns 706.61 ns 723.31 ns]
                        thrpt:  [42.192 GiB/s 43.189 GiB/s 44.258 GiB/s]
                 change:
                        time:   [−96.407% −96.310% −96.220%] (p = 0.00 < 0.05)
                        thrpt:  [+2545.3% +2610.1% +2683.4%]
                        Performance has improved.

@ivarflakstad ivarflakstad marked this pull request as ready for review June 5, 2026 12:04
@ivarflakstad ivarflakstad requested a review from EricLBuehler June 5, 2026 12:04
@ivarflakstad ivarflakstad changed the title NdIter NdIter + cpu vec_add and vec_scalar_add Jun 5, 2026

@EricLBuehler EricLBuehler left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@EricLBuehler EricLBuehler merged commit c848799 into main Jun 10, 2026
12 checks passed
@EricLBuehler EricLBuehler deleted the cpu-nditer branch June 10, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants