Configure GPU index for 'astra_cuda', select GPU currently used by PyTorch in OperatorModule by jleuschn · Pull Request #1546 · odlgroup/odl

jleuschn · 2020-03-25T16:21:20Z

This pull request implements two new features:

Choice of GPU for 'astra_cuda' via settable property 'gpu_index' in RayTransformBase
Automatically select GPU currently used by PyTorch in OperatorModule by setting 'gpu_index'

The second feature feels a little hacky, since it assumes the special role of the 'gpu_index' property if existing for any Operator instance that is wrapped by the OperatorModule. However this seems to me to be the most non-invasive way to implement this behaviour, since otherwise probably RayTransformBase would have to know about torch, e.g. offering gpu_index = 'torch_current'.

…rModule

kohr-h · 2020-03-30T16:16:49Z

Hey @jleuschn, thanks a lot for your effort! From a coding point of view, it looks very good.

However, I'm afraid that this solution is too hacky and not future-proof. That's mostly due to circumstances and not your fault. Let me give my reasoning here.

We have already a couple of times opted against introducing writable properties in Operator (and the majority of other classes). It's a recipe for bugs and makes it harder to reason about a given instance. Instead, we have chosen to always generate a new instance from scratch with the relevant parts changed (as part of the input parameters). See for instance TensorSpace.astype. That comes with its own troubles, but in the worst case a new attribute won't be copied, which is usually not as bad as an old one not being cleared.
We don't have cloning semantics for Operator. For some operators it would be easy, but more than a few build up internal state, and what do we do with that? You also realized that you need to clear the cache of a ray transform to make it work, and any change of RayTransform in the caching will have to make the cloning function aware. That's a maintenance burden I'd like to avoid.
Another reason for not having cloning semantics is the incurred cost. It looks like a cheap operation on the surface, but it can be very expensive numerically, e.g., MultiplyOperator on a large space. Not having cloning semantics is a hint that cloning might not be a good idea.
Having a gpu_index parameter in RayTransform directly is somewhat problematic since it's only valid as option if a GPU-based implementation is chosen. In principle a call like RayTransform(..., impl='astra_cpu', gpu_index=0) doesn't make sense and should fail. That would be ugly but could be done.
What's also problematic is the prospect of a potentially conflicting property in the supplied spaces. There is a semi-complete attempt to support CuPy as backend for arrays (see ENH: add cupy tensor space #1231, has been dead for a long time but will be revived), and there the GPU index is tied to the tensor space. And since I don't anticipate any use for a ray transform that has input and output on GPU 0 but computes on GPU 1, those indices would be in conflict.
So I'd like to find a way to manage the compute devices in a compatible and conflict-free manner. One way I could imagine would be to mimic CuPy's Device API in a minimal way, i.e., just the id part and the context manager. Later on it would be easy to drop in CuPy's Device instead. We could make it an "inofficial" API somewhere in astra_cuda.py for now.
Finally, as you say, the solution for OperatorModule is quite hacky in itself and relies on more hacky things. That's a bit too much for my taste, even if we're in a contrib package.
The main trouble is that for an already existing operator, it's hard to find out how to re-create it, and even harder to figure out how to create a valid variant of it. But there's no reason why it has to be that way. Creating an ODL Operator outside and then passing it to OperatorModule for mere wrapping is in no way better than the "obvious" alternative, namely passing a constructor plus arguments to OperatorModule, with the instantiation happening inside. That way it would be much easier (not trivial but easier) to create a new version of the same operator when needed.
In principle, both modes could be supported, e.g. by a signature __init__(self, op_or_fact, *args, **kwargs). Or, what would be the cleaner solution IMO, by keeping the current __init__ and implementing an alternative constructor like
```
@classmethod
def from_op_factory(cls, op_fact, args=(), kwargs=None):
    if kwargs is None:
        kwargs = {}
    op = op_fact(*args, **kwargs)  # construct `Operator` instance
    instance = cls(op)  # construct `OperatorModule` instance
    # Add constructor details to instance (set to `None` in `__init__`)
    instance._op_fact = op_fact
    instance._op_args = args
    instance._op_kwargs = kwargs
    return instance
```
This would be used as OperatorModule.from_op_factor(RayTransform, ...), and in _replicate_for_data_parallel the operators would be created anew. And finally, to make that work as expected, each of these constructor calls would be run in a with Device(dev_id) context to make them use the correct GPU ID.
The final piece to make that work is to support call parameters in RayTransform that are passed on to the backend. That's a more generic and flexible approach compared to an immediate gpu_index, and it doesn't leak backend details into RayTransform.

Okay, this has become way longer than I expected, and I realize it will not be a trivial change. I will look into it tonight make a suggestion. But I have no way to test with multiple GPUs, so you would have to help me out @jleuschn. Does that sound okay?

jleuschn · 2020-03-30T18:08:28Z

Thanks, @kohr-h , for checking the request and pointing out the issues above!
I agree with your points. 1.--3. seem quite possible to overcome. Regarding 4., the factory-based approach seems nice. Selecting the "correct" GPU seems a little bit more difficult to me, though:

While parameters and buffers are broadcasted to the GPUs in PyTorch's replicate call, modules are replicated using _replicate_for_data_parallel, which is agnostic of the target GPU ID (so we cannot assign the correct GPU index there trivially). Later in parallel_apply, they use the context with torch.cuda.device(device) around the call to the wrapped module, see here. (This context looks quite similar to the CuPy one, BTW.)
Thus, we need to recreate the Operator in the forward and backward call, where we can get the current torch device from torch.cuda.current_device(). Of course one could introduce a cache of Operator instances for all GPUs, so the constructor is effectively called only once for each GPU.

Yes, i can help out testing on multiple GPUs!
However, if you feel this is PR is not worth its time ATM because we are not using VRAM directly, i'm okay with that too. Otherwise i could contribute by implementing (parts of) it based on your suggestions :)

kohr-h · 2020-03-30T20:56:07Z

Good point about the replication thing! Hm, so the first call to forward is the first opportunity to create the operator with the appropriate GPU index. And with the reasonable expectation that the next call will be on the same device, we could cache the operator 🤔
Okay, that sounds slightly hacky but okay. We still need to special-case the ray transform to pass the right extra args to the constructor, but that's at least easy to extend.

Regarding the larger question of whether it's worth the effort. Currently I have my doubts. The whole thing is quite inefficient anyway since each ray transform does the whole roundtrip CPU->GPU->CPU no matter what, and if it's wrapped into an OperatorModule on the (or a) GPU, we get the nice chain GPU->CPU->GPU->CPU-> GPU. That will, of course, be much more interesting once the intermediate transfers can be eliminated. But at the moment, the speedup from the parallelization of the very middle part is the only possible gain. It might not be worth it. Did you run any speed tests?

jleuschn · 2020-03-31T09:07:24Z

Yes, i ran some speed test. It seems that it only makes a difference if the GPUs are heavily used already by the rest of the network. TBH, i don't fully understand why, considering the mentioned chain, maybe the chains are not in sync between the different GPUs, so some ray trafo runs in parallel to another layer?
So back to the testing, i artificially made some learned primal-dual model with stupidly many parameters in order to get a high GPU utilization, and then running 'astra_cuda' on different GPUs saved 1/3 of the computation time. Which is notable, but not that essential IMHO. So maybe it would be better to wait with this.

kohr-h · 2020-03-31T10:59:18Z

Very good, thanks for doing the speed test! Indeed, the gain is not nothing, but certainly not what you would hope for when throwing N times the compute power at the problem. So I agree, for now it's not necessary to invest time, but it's good to know about this limitation and that we need to think about solutions at some point.
I'll suggest we keep both the issue and the PR open and come back to them later.

…n-affected axis

jleuschn added 3 commits March 24, 2020 15:50

introduce gpu_index for astra_cuda backend and RayTransform

193a06b

run RayTransform on torch.cuda.current_device() if wrapped by Operato…

17584ac

…rModule

fix: add check if cuda is available in OperatorModule

45e4e59

jleuschn mentioned this pull request Mar 25, 2020

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

Open

kohr-h added area: contrib status: postponed type: deficiency labels Apr 19, 2020

ozanoktem mentioned this pull request May 1, 2020

Pytorch and tensor flow backend pass through the CPU. #1558

Open

fix _resize_discr in case original discr has nodes_on_bdry=True on no…

a111e34

…n-affected axis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure GPU index for 'astra_cuda', select GPU currently used by PyTorch in OperatorModule #1546

Configure GPU index for 'astra_cuda', select GPU currently used by PyTorch in OperatorModule #1546
jleuschn wants to merge 4 commits into
odlgroup:masterfrom
jleuschn:master

jleuschn commented Mar 25, 2020

Uh oh!

kohr-h commented Mar 30, 2020

Uh oh!

jleuschn commented Mar 30, 2020

Uh oh!

kohr-h commented Mar 30, 2020

Uh oh!

jleuschn commented Mar 31, 2020

Uh oh!

kohr-h commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jleuschn commented Mar 25, 2020

Uh oh!

kohr-h commented Mar 30, 2020

Uh oh!

jleuschn commented Mar 30, 2020

Uh oh!

kohr-h commented Mar 30, 2020

Uh oh!

jleuschn commented Mar 31, 2020

Uh oh!

kohr-h commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants