Do evaluation with gloo backend, and only on process 0#6
Conversation
|
Oops, it seems It looks like the training code can be fairly easily adjusted to use my proposed Actually, It also will fail if using PCA during evaluation, but that should probably also only happen on process 0, and not on all processes. Sorry, will update PR to make it work for all places where |
|
Thanks a lot for your contribution! The large memory request of GPU and CPU seems a significant issue in this codebase. You provide quite a good solution. I will look into your code, and adjust the training part to match it. |
Thanks for publishing this great code base! In playing around with it we ran into some issues with being unable to evaluate on large databases because the
extract_features()function requires (because of the NCCL backend) that data is send via the GPU.Additionally, by relying on the NCCL backend (which doesn't support
gather()or point-to-point communication), the data was actually shared among all processes, which caused us issues even when the data did fit on GPU (but then redundant copies caused CPU memory issues). It also meant that the evaluation was actually performed N times (where N = world_size).To fix these issues I've rewritten the
extract_features()function to use the gloo backend for gathering all the data. This way it doesn't have to go via the GPU, but once extracted the data can stay in CPU memory, it also makes it possible to gather everything on process 0, and then do the evaluation only once.I didn't thoroughly compare, but I think its now always better to use --sync_gather. Either way both cases should be faster and use less memory than before.
(I didn't want to take too many GPUs with CVPR deadline so close, so I only tested with 2 GPUs, but it should scale to larger world_size.)