Small optimizations#768
Conversation
…n. Reduced the memcpys from 3 to 1 for the general case.
… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread
… used for better tensor core usage.
| @@ -1,3 +1,8 @@ | |||
| /* Part of this file was contributed by NVIDIA under license: | |||
| * Copyright (C) 2020 NVIDIA Corporation | |||
| * SPDX-License-Identifier: MIT | |||
There was a problem hiding this comment.
What is this for? It seems the only change is the rounding of maxDims. What NVidia contribution was made here?
In general, I would be opposed to changing the comment style for licence. Marian source code does not have license information in the source files directly, but rather in a separate license file. Please let's continue to follow that pattern.
There was a problem hiding this comment.
Yes that was the only contribution. However, I was told to include a notice even in files where I make one line changes. This is just me following instructions.
There was a problem hiding this comment.
Also, I just saw the second part of your comment. Is there a process for adding NVIDIA to the license file? I'm not sure what the solution is here.
There was a problem hiding this comment.
The one that hasn't been updated since 2016 and doesn't even name Microsoft? I guess add it to your PR and shame Marcin.
There was a problem hiding this comment.
I checked internally and I can add NVIDIA to the main license file and remove the notices in all the other files!
I will take care of this in all the PRs I have submitted.
Edit: Let me know if the license change looks ok.
| Expr indices; | ||
| // I think this doesn't work if model split among gpus but not sure if it matters | ||
|
|
||
| for (auto& state : states_) { |
There was a problem hiding this comment.
Please add a comment what this logic does, as it is not obvious why the old code is not working.
There was a problem hiding this comment.
There is nothing wrong with the old code. This just needs to check if we need to ship indices to the GPU. I will add a comment explaining what this does.
There was a problem hiding this comment.
I added a comment in a recent commit but I'm not sure if it's clear enough. If so, feel free to resolve this.
| if (state.output) { | ||
| indices = state.output->graph()->indices(selIdx); | ||
| break; | ||
| } |
There was a problem hiding this comment.
What happens if neither? Is that a valid condition? If not, let's change this to else
There was a problem hiding this comment.
I think neither is a valid condition
There was a problem hiding this comment.
Ah, because it's a loop, sorry. But what happens if it never matches any of the conditions? Then indices ends up being NULL. Is it worth to ABORT_IF for that?
There was a problem hiding this comment.
Good point. I think indices can only end up being NULL if all of the states' output and cell fields are null. In that case, we will return a vector of nulls which was the behavior of the original code. (Also, I think these values are NULL on the first run of a network so we want this behavior).
Description
This PR splits out some small optimizations from PR #743.
Performance improvements from this PR as measured on a Titan V using a proxy transformer model:
Times with 1 Stream
Times with two streams
List of changes:
How to test
I ran the regression tests. On Volta, some regression tests fail due to the additional use of tensor cores. This is because prior to CUDA 11, cublas does not use tensorcores if matrices fail to meet alignment requirements. This restriction is lifted in CUDA >= 11.
The differences in the float output from the failing regression tests are small. A list of them are provided here:
OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243
cmake command:
cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on
Checklist