pytorch model weight updates aren't averaged when running on GKE #3461

ioga · 2022-03-09T20:51:57Z

Environment:

Framework: PyTorch
Framework version: torch 1.9 / torch 1.10.2
Horovod version: 0.24.0, 0.24.1; worked okay in 0.23.0
CUDA version: cuda 11.1 / cuda 11.3

Running on GKE "Regular" channel 1.21.6-gke.1503, nvidia drivers 450.119.04, 4 x Tesla K80

Example docker image: determinedai/environments-dev:cuda-11.3-pytorch-1.10-lightning-1.5-tf-2.8-gpu-3da66d1, built on top of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04.

Bug report:
Model weights updates are seemingly not averaged between workers, but summed up instead: when running with 2 workers, updates are 2x what they are supposed to be, when running with 4 workers, it's 4x etc.

This is only reproducible when running on GKE. Same exact build runs okay on a plain instance running same version of nvidia drivers 450.119.04, and same GPU (Tesla K80).

Test script is the same as in #3460

from typing import Any, Dict, Tuple

import torch.utils.data
import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

class OnesDataset(torch.utils.data.Dataset):
    def __len__(self) -> int:
        return 64

    def __getitem__(self, index: int) -> Tuple:
        return torch.Tensor([float(1)])

train_dataset = OnesDataset()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16)

model = torch.nn.Linear(1, 1, False)
model.weight.data.fill_(0)
model = model.cuda()

loss_fn = torch.nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), 0.1)
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(1):
    for batch_idx, data in enumerate(train_loader):
        data = data.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, data)
        loss.backward()
        optimizer.step()
        if hvd.rank() == 0:
            weight = model.weight.data.item()
            print('weight:', weight)

Actual behavior:

Running horovodrun -np 2 python train.py outputs:

[0]<stdout>:weight: 0.64000004529953
[0]<stdout>:weight: 0.784000039100647
[0]<stdout>:weight: 0.8704000115394592

Expected behavior:

Running the same on a plain setup works okay:

[0]<stdout>:weight: 0.20000000298023224
[0]<stdout>:weight: 0.36000001430511475
[0]<stdout>:weight: 0.4880000054836273
[0]<stdout>:weight: 0.590399980545044

Notice that the weight difference between steps is 2x when running -np 2 between expected and actual behaviors.

Troubleshooting:

I've bisected horovod from 0.23.0 to 0.24.1 and tracked it down to this specific PR which refactored cmake & CUDA configuration: Update to CMake 3.13 for better CUDA support and to enable build concurrency #3261 . Somehow this affects the computation, but only at runtime on GKE. I couldn't repro it anywhere else.
Same nvidia-drivers version 450.119.04 on a plain instance worked okay.
Building with MAKEFLAGS=-j1 didn't help.
I've tried out official horovod/horovod:0.24.1 images and discovered pytorch model weights aren't updated properly in horovod/horovod:0.24.1 docker image #3460 in the process. But what's especially weird, when running horovod/horovod:0.24.1 on GKE, the weight updates have both issues (this one and pytorch model weights aren't updated properly in horovod/horovod:0.24.1 docker image #3460 - not updating weights). So with two workers, the weights are double what they're supposed to be and also not updating:

[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645

Any ideas what can be causing this? And how exactly changes from #3261 can be affecting this at runtime?

The text was updated successfully, but these errors were encountered:

nvcastet · 2022-03-09T21:00:18Z

FYI @maxhgerlach ^^

ioga added the bug label Mar 9, 2022

nvcastet mentioned this issue Mar 9, 2022

Fix ignored cuda arch flags #3462

Merged

4 tasks

ioga closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch model weight updates aren't averaged when running on GKE #3461

pytorch model weight updates aren't averaged when running on GKE #3461

ioga commented Mar 9, 2022 •

edited

nvcastet commented Mar 9, 2022

pytorch model weight updates aren't averaged when running on GKE #3461

pytorch model weight updates aren't averaged when running on GKE #3461

Comments

ioga commented Mar 9, 2022 • edited

nvcastet commented Mar 9, 2022

ioga commented Mar 9, 2022 •

edited