fft_small/mpn_mul: Speed up CRT reconstruction by avoid split accumulator#2693
fft_small/mpn_mul: Speed up CRT reconstruction by avoid split accumulator#2693user202729 wants to merge 1 commit into
Conversation
15c1554 to
d7a1633
Compare
…mul_7_6, _reduce_big_sum_7
d7a1633 to
de2d1ac
Compare
It would be good to profile both versions on a few different architectures first. |
|
Apparently, this slows these two cases down on a Mac. Anyway, I realize the reason why two arrays is faster than one only works when the fixed multiplier has much less than 64 bits (50 bits in this case). It should be possible to rearrange the computation somewhat though. |
|
Update: this does not slow down these two cases in a Mac (it's neutral, probably because Mac has too many registers). Is mostly human mistake (check out an old build, forget to beforeafterThat said, it's true that |
Currently, in order to reduce dependency chain length, functions in
crt_helpers.huse two parallel arraysrandtto represent a multi-limb values.Unfortunately, this means a N-limb value needs 2N registers. In the largest case, this leads to massive slowdown, likely because of register spilling to memory.
This PR address that by changing the code of the largest case to use just N registers.
Speedup can be verified by running
./build/fft_small/profile/p-mulbefore and after the change, the.40and.50rows are ~ 2x faster, and is essentially tied with #2692 . (Probably because of added instruction level parallelism.)Since the change here is obviously much simpler than #2692 , this is to be preferred.
Points for discussion: is it worth switching everything to use this one-accumulator version? It might lead to slower performance because of the need to propagate carry.
Benchmark result
`crt-register-reduce` (this PR)
`mpn-mul-crt-asm` (the other PR with massive amount of inline assembly)
baseline
Benchmark instruction