Skip to content

fft_small/mpn_mul: Speed up CRT reconstruction by avoid split accumulator#2693

Open
user202729 wants to merge 1 commit into
flintlib:mainfrom
user202729:crt-register-reduce
Open

fft_small/mpn_mul: Speed up CRT reconstruction by avoid split accumulator#2693
user202729 wants to merge 1 commit into
flintlib:mainfrom
user202729:crt-register-reduce

Conversation

@user202729

@user202729 user202729 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Currently, in order to reduce dependency chain length, functions in crt_helpers.h use two parallel arrays r and t to represent a multi-limb values.

Unfortunately, this means a N-limb value needs 2N registers. In the largest case, this leads to massive slowdown, likely because of register spilling to memory.

This PR address that by changing the code of the largest case to use just N registers.

Speedup can be verified by running ./build/fft_small/profile/p-mul before and after the change, the .40 and .50 rows are ~ 2x faster, and is essentially tied with #2692 . (Probably because of added instruction level parallelism.)

Since the change here is obviously much simpler than #2692 , this is to be preferred.

Points for discussion: is it worth switching everything to use this one-accumulator version? It might lead to slower performance because of the need to propagate carry.

Benchmark result

`crt-register-reduce` (this PR)
 --- fft_small 1 thread  --- 
20.00:  2.883  2.827  2.924
20.10:  2.729  2.667  2.799
20.20:  2.599  2.526  2.705
20.30:  2.517  2.424  2.627
20.40:  2.599  2.524  2.691
20.50:  2.512  2.480  2.591
20.60:  2.736  2.686  2.773
20.70:  2.709  2.665  2.753
20.80:  2.664  2.604  2.738
20.90:  2.809  2.771  2.835
precomp: avg 0.207837, max 1.982478
21.00:  2.777  2.751  2.846
21.10:  2.728  2.679  2.766
21.20:  2.695  2.590  3.055
21.30:  2.640  2.570  2.828
21.40:  2.650  2.563  2.896
21.50:  2.563  2.532  2.622
21.60:  2.711  2.648  2.769
21.70:  2.701  2.648  2.778
21.80:  2.665  2.606  2.720
21.90:  2.753  2.673  2.824
precomp: avg 0.141137, max 2.508518
22.00:  2.688  2.646  2.752
22.10:  2.681  2.623  2.775
22.20:  2.618  2.518  2.680
22.30:  2.604  2.494  2.734
22.40:  2.628  2.537  2.761
22.50:  2.659  2.513  2.867
22.60:  2.730  2.664  2.827
22.70:  2.713  2.618  2.952
22.80:  2.710  2.659  2.750
22.90:  2.678  2.668  2.683
precomp: avg 0.072375, max 1.270793
23.00:  2.671  2.623  2.727
23.10:  2.638  2.579  2.749
23.20:  2.564  2.532  2.617
23.30:  2.630  2.518  2.670
23.40:  2.603  2.546  2.725
23.50:  2.722  2.655  2.879
23.60:  2.597  2.544  2.762
23.70:  2.577  2.505  2.735
23.80:  2.571  2.502  2.620
23.90:  2.743  2.696  2.815
precomp: avg 0.078434, max 1.254857
24.00:  2.718  2.630  2.769
24.10:  2.722  2.614  2.831
24.20:  2.636  2.547  2.723
24.30:  2.614  2.570  2.672
24.40:  2.695  2.618  2.782
24.50:  2.639  2.570  2.691
24.60:  2.674  2.608  2.727
24.70:  2.746  2.664  2.808
24.80:  2.682  2.642  2.716
24.90:  2.690  2.591  2.853
precomp: avg 0.062011, max 1.107677
25.00:  2.764  2.657  2.897
25.10:  2.663  2.636  2.690
25.20:  2.621  2.573  2.681
25.30:  2.665  2.618  2.785
25.40:  2.715  2.650  2.799
25.50:  2.680  2.631  2.816
25.60:  2.711  2.650  2.758
25.70:  2.765  2.696  2.930
25.80:  2.659  2.613  2.733
25.90:  2.740  2.638  2.825
precomp: avg 0.062647, max 1.382998
`mpn-mul-crt-asm` (the other PR with massive amount of inline assembly)
 --- fft_small 1 thread  --- 
20.00:  2.933  2.809  3.149
20.10:  2.817  2.762  2.888
20.20:  2.717  2.619  2.828
20.30:  2.561  2.502  2.618
20.40:  2.633  2.590  2.682
20.50:  2.581  2.543  2.637
20.60:  2.811  2.696  2.904
20.70:  2.816  2.677  2.956
20.80:  2.758  2.653  2.975
20.90:  2.861  2.756  2.938
precomp: avg 0.220455, max 2.087928
21.00:  2.782  2.732  2.835
21.10:  2.894  2.779  3.029
21.20:  2.758  2.610  2.938
21.30:  2.650  2.591  2.767
21.40:  2.746  2.659  2.826
21.50:  2.620  2.526  2.729
21.60:  2.835  2.734  2.942
21.70:  2.790  2.686  2.874
21.80:  2.739  2.623  2.839
21.90:  2.818  2.753  2.908
precomp: avg 0.118175, max 1.716897
22.00:  2.812  2.719  2.895
22.10:  2.684  2.629  2.788
22.20:  2.645  2.573  2.778
22.30:  2.658  2.536  2.775
22.40:  2.614  2.476  2.748
22.50:  2.540  2.450  2.622
22.60:  2.622  2.610  2.631
22.70:  2.611  2.593  2.661
22.80:  2.711  2.650  2.785
22.90:  2.719  2.658  2.770
precomp: avg 0.067073, max 1.119393
23.00:  2.741  2.668  2.894
23.10:  2.675  2.612  2.720
23.20:  2.627  2.572  2.718
23.30:  2.695  2.610  2.745
23.40:  2.666  2.558  2.820
23.50:  2.687  2.606  2.790
23.60:  2.639  2.583  2.754
23.70:  2.657  2.537  2.721
23.80:  2.681  2.599  2.771
23.90:  2.802  2.760  2.919
precomp: avg 0.066612, max 1.159107
24.00:  2.745  2.662  2.804
24.10:  2.785  2.696  2.894
24.20:  2.575  2.468  2.663
24.30:  2.638  2.559  2.716
24.40:  2.699  2.545  2.940
24.50:  2.588  2.532  2.664
24.60:  2.743  2.686  2.900
24.70:  2.746  2.638  2.838
24.80:  2.638  2.550  2.698
24.90:  2.639  2.603  2.706
precomp: avg 0.124808, max 1.258381
25.00:  2.662  2.631  2.730
25.10:  2.651  2.601  2.777
25.20:  2.566  2.517  2.646
25.30:  2.533  2.503  2.579
25.40:  2.594  2.544  2.637
25.50:  2.592  2.547  2.610
25.60:  2.663  2.637  2.718
25.70:  2.662  2.620  2.705
25.80:  2.684  2.619  2.922
25.90:  2.679  2.650  2.741
precomp: avg 0.060714, max 0.983494
baseline
 --- fft_small 1 thread  --- 
20.00:  2.807  2.705  3.108
20.10:  2.726  2.674  2.783
20.20:  2.584  2.514  2.647
20.30:  2.528  2.476  2.583
20.40:  6.412  6.275  6.567
20.50:  6.522  6.269  6.951
20.60:  2.780  2.747  2.815
20.70:  2.761  2.696  2.858
20.80:  2.745  2.615  2.962
20.90:  2.799  2.768  2.864
precomp: avg 0.234419, max 1.939602
21.00:  2.808  2.718  2.920
21.10:  2.813  2.667  2.962
21.20:  2.643  2.563  2.691
21.30:  2.677  2.581  2.776
21.40:  6.441  6.224  6.679
21.50:  6.440  6.219  6.596
21.60:  2.794  2.716  2.840
21.70:  2.774  2.750  2.807
21.80:  2.732  2.666  2.797
21.90:  2.798  2.725  2.879
precomp: avg 0.129088, max 1.642563
22.00:  2.764  2.743  2.798
22.10:  2.688  2.658  2.724
22.20:  2.643  2.597  2.713
22.30:  2.598  2.566  2.629
22.40:  6.130  5.934  6.329
22.50:  6.074  5.877  6.279
22.60:  2.700  2.603  2.815
22.70:  2.657  2.637  2.702
22.80:  2.701  2.665  2.754
22.90:  2.676  2.638  2.723
precomp: avg 0.095950, max 1.388836
23.00:  2.731  2.645  2.869
23.10:  2.679  2.576  2.761
23.20:  2.559  2.513  2.682
23.30:  2.652  2.579  2.709
23.40:  6.081  5.960  6.236
23.50:  6.056  5.957  6.255
23.60:  2.631  2.570  2.699
23.70:  2.607  2.525  2.833
23.80:  2.544  2.499  2.569
23.90:  2.690  2.641  2.708
precomp: avg 0.062193, max 1.320420
24.00:  2.692  2.615  2.763
24.10:  2.667  2.607  2.727
24.20:  2.509  2.475  2.560
24.30:  2.528  2.484  2.656
24.40:  5.762  5.708  5.799
24.50:  5.789  5.673  5.952
24.60:  2.626  2.526  2.732
24.70:  2.670  2.549  2.796
24.80:  2.651  2.578  2.719
24.90:  2.648  2.585  2.741
precomp: avg 0.030542, max 1.112313
25.00:  2.686  2.616  2.759
25.10:  2.651  2.602  2.681
25.20:  2.547  2.511  2.598
25.30:  2.540  2.495  2.603
25.40:  5.684  5.633  5.743
25.50:  5.665  5.578  5.737
25.60:  2.751  2.685  2.949
25.70:  2.692  2.636  2.821
25.80:  2.649  2.594  2.733
25.90:  2.678  2.632  2.717
precomp: avg 0.060969, max 1.070591

Benchmark instruction

make -j50
make ./build/fft_small/profile/p-mul
./build/fft_small/profile/p-mul

@user202729 user202729 marked this pull request as draft May 21, 2026 16:56
@user202729 user202729 force-pushed the crt-register-reduce branch from 15c1554 to d7a1633 Compare May 22, 2026 14:07
@user202729 user202729 marked this pull request as ready for review May 22, 2026 14:08
@user202729 user202729 force-pushed the crt-register-reduce branch from d7a1633 to de2d1ac Compare May 22, 2026 14:13
@fredrik-johansson

Copy link
Copy Markdown
Collaborator

Points for discussion: is it worth switching everything to use this one-accumulator version? It might lead to slower performance because of the need to propagate carry.

It would be good to profile both versions on a few different architectures first.

@user202729

Copy link
Copy Markdown
Contributor Author

Apparently, this slows these two cases down on a Mac.

Anyway, I realize the reason why two arrays is faster than one only works when the fixed multiplier has much less than 64 bits (50 bits in this case). It should be possible to rearrange the computation somewhat though.

@user202729

user202729 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Update: this does not slow down these two cases in a Mac (it's neutral, probably because Mac has too many registers). Is mostly human mistake (check out an old build, forget to --enable-assembly, forget to clean up autogenerated src/config.h[.in] because the build system does not automatically do that)

before
 --- fft_small 1 thread  --- 
20.00:  2.381  2.157  3.412
20.10:  2.132  2.117  2.159
20.20:  2.262  2.246  2.280
20.30:  2.221  2.201  2.245
20.40:  2.330  2.313  2.339
20.50:  2.289  2.274  2.303
20.60:  2.245  2.215  2.277
20.70:  2.234  2.201  2.268
20.80:  2.201  2.161  2.233
20.90:  2.236  2.221  2.259
precomp: avg 0.328439, max 10.001241
21.00:  2.234  2.204  2.270
21.10:  2.191  2.149  2.217
21.20:  2.343  2.297  2.374
21.30:  2.314  2.282  2.335
21.40:  2.341  2.322  2.355
21.50:  2.300  2.283  2.319
21.60:  2.261  2.223  2.292
21.70:  2.247  2.205  2.275
21.80:  2.211  2.168  2.238
21.90:  2.253  2.217  2.267
precomp: avg 0.095625, max 0.433954
22.00:  2.240  2.202  2.258
22.10:  2.205  2.171  2.222
22.20:  2.340  2.328  2.349
22.30:  2.305  2.280  2.321
22.40:  2.356  2.287  2.423
22.50:  2.324  2.248  2.399
22.60:  2.313  2.296  2.328
22.70:  2.297  2.291  2.303
22.80:  2.251  2.237  2.258
22.90:  2.285  2.275  2.291
precomp: avg 0.053925, max 0.303884
23.00:  2.290  2.267  2.335
23.10:  2.251  2.231  2.279
23.20:  2.367  2.356  2.371
23.30:  2.320  2.313  2.325
23.40:  2.444  2.410  2.473
23.50:  2.390  2.385  2.395
23.60:  2.335  2.321  2.344
23.70:  2.320  2.310  2.328
23.80:  2.318  2.288  2.383
23.90:  2.317  2.300  2.324
precomp: avg 0.032803, max 0.220024
24.00:  2.327  2.313  2.348
24.10:  2.283  2.270  2.317
24.20:  2.409  2.394  2.432
24.30:  2.363  2.355  2.367
24.40:  2.480  2.460  2.530
24.50:  2.442  2.423  2.508
24.60:  2.464  2.411  2.498
24.70:  2.422  2.377  2.445
24.80:  2.375  2.338  2.450
24.90:  2.441  2.395  2.482
precomp: avg 0.015412, max 0.218362
25.00:  2.377  2.361  2.390
25.10:  2.340  2.312  2.373
25.20:  2.462  2.447  2.473
25.30:  2.427  2.417  2.435
25.40:  2.581  2.557  2.601
25.50:  2.499  2.482  2.521
25.60:  2.533  2.500  2.557
25.70:  2.498  2.480  2.511
25.80:  2.429  2.407  2.444
25.90:  2.471  2.435  2.491
precomp: avg 0.016391, max 0.143278
after
 --- fft_small 1 thread  --- 
20.00:  2.416  2.159  3.615
20.10:  2.133  2.123  2.142
20.20:  2.265  2.250  2.285
20.30:  2.229  2.203  2.272
20.40:  2.351  2.332  2.369
20.50:  2.315  2.295  2.332
20.60:  2.244  2.215  2.273
20.70:  2.238  2.203  2.270
20.80:  2.196  2.157  2.232
20.90:  2.241  2.208  2.268
precomp: avg 0.336676, max 10.766291
21.00:  2.232  2.205  2.268
21.10:  2.192  2.155  2.230
21.20:  2.346  2.283  2.383
21.30:  2.317  2.285  2.337
21.40:  2.369  2.345  2.399
21.50:  2.331  2.309  2.353
21.60:  2.266  2.230  2.292
21.70:  2.248  2.206  2.274
21.80:  2.214  2.173  2.241
21.90:  2.254  2.215  2.263
precomp: avg 0.098097, max 0.420648
22.00:  2.241  2.220  2.256
22.10:  2.205  2.179  2.225
22.20:  2.335  2.321  2.348
22.30:  2.306  2.297  2.320
22.40:  2.394  2.318  2.471
22.50:  2.362  2.285  2.445
22.60:  2.315  2.300  2.326
22.70:  2.287  2.274  2.295
22.80:  2.254  2.244  2.260
22.90:  2.286  2.277  2.292
precomp: avg 0.052319, max 0.269028
23.00:  2.277  2.270  2.284
23.10:  2.231  2.228  2.240
23.20:  2.361  2.346  2.369
23.30:  2.322  2.316  2.327
23.40:  2.469  2.458  2.476
23.50:  2.437  2.434  2.440
23.60:  2.351  2.333  2.359
23.70:  2.319  2.304  2.327
23.80:  2.287  2.278  2.292
23.90:  2.327  2.311  2.340
precomp: avg 0.028490, max 0.223473
24.00:  2.305  2.294  2.313
24.10:  2.271  2.263  2.281
24.20:  2.388  2.375  2.399
24.30:  2.359  2.354  2.367
24.40:  2.494  2.487  2.505
24.50:  2.472  2.460  2.499
24.60:  2.440  2.401  2.472
24.70:  2.400  2.360  2.426
24.80:  2.350  2.330  2.370
24.90:  2.410  2.383  2.422
precomp: avg 0.019453, max 0.219641
25.00:  2.367  2.358  2.376
25.10:  2.297  2.266  2.326
25.20:  2.454  2.443  2.466
25.30:  2.410  2.403  2.417
25.40:  2.562  2.547  2.571
25.50:  2.521  2.519  2.523
25.60:  2.522  2.492  2.538
25.70:  2.484  2.473  2.494
25.80:  2.424  2.409  2.442
25.90:  2.470  2.435  2.497
precomp: avg 0.015258, max 0.158447

That said, it's true that umul_ppmm without --enable-assembly (which split the 64-bit limb into two 32-bit limbs) is much slower than _madd which uses int128. Should be safe to have an implementation that check __SIZEOF_INT128__ (or __GNUC__ like crt_helpers.h, clang depends this too despite the name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants