Add optimised max_element implementation as a specific case of nth_element with beam = 1#981
Add optimised max_element implementation as a specific case of nth_element with beam = 1#981XapaJIaMnu wants to merge 4 commits into
Conversation
…ement with beam size of 1
|
Apologies for the late reply @hieuhoang , I have some benchmarks, finally. tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.
To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz |
|
I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward. <style> </style>
|
|
I'd like to do some more tests first. |
|
I may do when I have time. But your results are your own. I review PR to protect results on my models |
|
Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer? |
|
I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..? |
|
LSH needs to be used with output layer without bias, shortlist doesn't have that restriction. Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments. if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller |
|
I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist. How big are your output layers typically? I used 50 50 shortlist for my previous test. I can't get improvement with 100 1024 LSH. What settings do you use? When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang. |
Description
Add optimised max_element implementation as a specific case of nth_element with n = 1
Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test
A summary:
Cascade lake results
Ryzen 9 5900HS results
Added dependencies: none
How to test
Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%.
I didn't run all regression tests because there's something broken in them right now.
Checklist