Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference#776
Open
rhenry-nv wants to merge 18 commits into
Open
Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference#776rhenry-nv wants to merge 18 commits into
rhenry-nv wants to merge 18 commits into
Conversation
…erence when processing the lemmas
…since the allocator has a memory pool that it manages for it won't get released by a cuda free. Additionally, two kernels may get the same pointer but they cannot execute concurrently since a single thread does not launch concurrent kernels. Since there is an allocator per thread, this means that no two kernels can ever race on the same pointer (I think). I have not seen any issues after removing this sync
…expose more parallelism when adding into the lemmas
Contributor
Author
|
FYI: I am currently requesting internally to remove the notices in each file and for NVIDIA to be added to the license file. I will take care of the licensing once I get confirmation. |
rhenry-nv
commented
Dec 15, 2020
| auto factorMaxima = max(logits_[g]->loss(), -1); | ||
| auto factorMasks = constant(getFactorMasks(g, shortlist ? shortlist->indices() : std::vector<WordIndex>())); | ||
| sel = sel + factorMaxima * factorMasks; // those lemmas that don't have a factor get multiplied with 0 | ||
| if(numGroups > 1 && graph()->isInference() && graph()->getBackend()->getDeviceId().type == DeviceType::gpu) { |
Contributor
Author
There was a problem hiding this comment.
This fork is something I wasn't sure how to remove. It would be better if it was under the expression operator but moving it down causes the operator interface to be a bit ugly and introduces some code duplication. Feedback on this in particular would be greatly appreciated.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a new inference operator to GPU for getting the lemma logits for a factored vocabulary. It demonstrates significant speedup on GPU inference over PR #772
Here are some perf numbers relative to PR #772
Times from a proxy model with 1 stream as measured on a Titan V.
Times from a proxy model with two streams as measured on a Titan V
List of changes:
Added dependencies: cub
How to test
I ran the regression tests and they all passed. I also manually tested on a proxy model and the outputs from master exactly match the outputs after this change was made.
CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on
Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0
Checklist