You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Horovod version: 0.24.0, 0.24.1; worked okay in 0.23.0
CUDA version: cuda 11.1 / cuda 11.3
Running on GKE "Regular" channel 1.21.6-gke.1503, nvidia drivers 450.119.04, 4 x Tesla K80
Example docker image: determinedai/environments-dev:cuda-11.3-pytorch-1.10-lightning-1.5-tf-2.8-gpu-3da66d1, built on top of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04.
Bug report:
Model weights updates are seemingly not averaged between workers, but summed up instead: when running with 2 workers, updates are 2x what they are supposed to be, when running with 4 workers, it's 4x etc.
This is only reproducible when running on GKE. Same exact build runs okay on a plain instance running same version of nvidia drivers 450.119.04, and same GPU (Tesla K80).
Environment:
Running on GKE "Regular" channel 1.21.6-gke.1503, nvidia drivers 450.119.04, 4 x Tesla K80
Example docker image:
determinedai/environments-dev:cuda-11.3-pytorch-1.10-lightning-1.5-tf-2.8-gpu-3da66d1
, built on top ofnvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04
.Bug report:
Model weights updates are seemingly not averaged between workers, but summed up instead: when running with 2 workers, updates are 2x what they are supposed to be, when running with 4 workers, it's 4x etc.
This is only reproducible when running on GKE. Same exact build runs okay on a plain instance running same version of nvidia drivers 450.119.04, and same GPU (Tesla K80).
Test script is the same as in #3460
Actual behavior:
Running
horovodrun -np 2 python train.py
outputs:Expected behavior:
Running the same on a plain setup works okay:
Notice that the weight difference between steps is 2x when running
-np 2
between expected and actual behaviors.Troubleshooting:
MAKEFLAGS=-j1
didn't help.horovod/horovod:0.24.1
images and discovered pytorch model weights aren't updated properly inhorovod/horovod:0.24.1
docker image #3460 in the process. But what's especially weird, when runninghorovod/horovod:0.24.1
on GKE, the weight updates have both issues (this one and pytorch model weights aren't updated properly inhorovod/horovod:0.24.1
docker image #3460 - not updating weights). So with two workers, the weights are double what they're supposed to be and also not updating:Any ideas what can be causing this? And how exactly changes from #3261 can be affecting this at runtime?
The text was updated successfully, but these errors were encountered: