Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch model weight updates aren't averaged when running on GKE #3461

Closed
ioga opened this issue Mar 9, 2022 · 1 comment
Closed

pytorch model weight updates aren't averaged when running on GKE #3461

ioga opened this issue Mar 9, 2022 · 1 comment
Labels

Comments

@ioga
Copy link
Contributor

ioga commented Mar 9, 2022

Environment:

  1. Framework: PyTorch
  2. Framework version: torch 1.9 / torch 1.10.2
  3. Horovod version: 0.24.0, 0.24.1; worked okay in 0.23.0
  4. CUDA version: cuda 11.1 / cuda 11.3

Running on GKE "Regular" channel 1.21.6-gke.1503, nvidia drivers 450.119.04, 4 x Tesla K80

Example docker image: determinedai/environments-dev:cuda-11.3-pytorch-1.10-lightning-1.5-tf-2.8-gpu-3da66d1, built on top of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04.

Bug report:
Model weights updates are seemingly not averaged between workers, but summed up instead: when running with 2 workers, updates are 2x what they are supposed to be, when running with 4 workers, it's 4x etc.

This is only reproducible when running on GKE. Same exact build runs okay on a plain instance running same version of nvidia drivers 450.119.04, and same GPU (Tesla K80).

Test script is the same as in #3460

from typing import Any, Dict, Tuple

import torch.utils.data
import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

class OnesDataset(torch.utils.data.Dataset):
    def __len__(self) -> int:
        return 64

    def __getitem__(self, index: int) -> Tuple:
        return torch.Tensor([float(1)])

train_dataset = OnesDataset()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16)

model = torch.nn.Linear(1, 1, False)
model.weight.data.fill_(0)
model = model.cuda()

loss_fn = torch.nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), 0.1)
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(1):
    for batch_idx, data in enumerate(train_loader):
        data = data.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, data)
        loss.backward()
        optimizer.step()
        if hvd.rank() == 0:
            weight = model.weight.data.item()
            print('weight:', weight)

Actual behavior:

Running horovodrun -np 2 python train.py outputs:

[0]<stdout>:weight: 0.64000004529953
[0]<stdout>:weight: 0.784000039100647
[0]<stdout>:weight: 0.8704000115394592

Expected behavior:

Running the same on a plain setup works okay:

[0]<stdout>:weight: 0.20000000298023224
[0]<stdout>:weight: 0.36000001430511475
[0]<stdout>:weight: 0.4880000054836273
[0]<stdout>:weight: 0.590399980545044

Notice that the weight difference between steps is 2x when running -np 2 between expected and actual behaviors.

Troubleshooting:

[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645                                                                                 
[1,0]<stdout>:weight: 0.4000000059604645

Any ideas what can be causing this? And how exactly changes from #3261 can be affecting this at runtime?

@ioga ioga added the bug label Mar 9, 2022
@nvcastet
Copy link
Collaborator

nvcastet commented Mar 9, 2022

FYI @maxhgerlach ^^

@ioga ioga closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants