DDP strategy doesn't work for on_validation_epoch_end, always hang #19783

jzhanghzau · 2024-04-16T15:25:09Z

Bug description

My code looks like below, i want compute a validation metric based on entire validation dataset. so i will append every results in every batch into a list, and then in on_validation_epoch_end function to compute the metric.

It works fine with single GPU, but when i use ddp strategy the do so, i always meet error, seems validation dataset hangs.

What version are you seeing the problem on?

v2.1, v2.2

How to reproduce the bug

self.validation_step_outputs = []
  self.validation_step_clusters = []

def validation_step(self, batch, batch_idx):

  batch_tokens, clusters = batch
  projection= self._common_step(batch_tokens)

  self.validation_step_outputs.append(projection)
  self.validation_step_clusters.append(clusters)
  
def on_validation_epoch_end(self):
  
  if self.trainer.is_global_zero:  # Check if this is the rank 0 process
  
    all_preds = torch.cat(self.validation_step_outputs, dim=0)

    all_clusters = LabelEncoder().fit_transform(list(itertools.chain.from_iterable(self.validation_step_clusters)))
    all_clusters = torch.tensor(all_clusters)
    
    self.validation_step_outputs.clear()
    self.validation_step_clusters.clear()

    loss = loss_func(all_preds, all_clusters)
    
    accuracy = self._cal_accuracy(all_preds, all_clusters)
                  
    self.log('validation_loss', loss, on_epoch=True, prog_bar=True)
    self.log('accuracy', accuracy, on_epoch=True, prog_bar=True)
    
  self.trainer.strategy.barrier()

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca

The text was updated successfully, but these errors were encountered:

awaelchli · 2024-04-16T16:03:36Z

@jzhanghzau self.log() issues collective calls, so you can't just call it on "rank zero" only. If you want to do that, pass self.log(rank_zero_only=True).

Here are the relevant docs for this:
https://lightning.ai/docs/pytorch/stable/visualize/logging_advanced.html#rank-zero-only

jzhanghzau · 2024-04-16T16:24:55Z

@jzhanghzau self.log() issues collective calls, so you can't just call it on "rank zero" only. If you want to do that, pass self.log(rank_zero_only=True).

Here are the relevant docs for this: https://lightning.ai/docs/pytorch/stable/visualize/logging_advanced.html#rank-zero-only

Thanks for your quick reply! if i setself.log(rank_zero_only=True), it seems like that i am not allowed to use callback, what if i still want to use callback(earylstopping), how can i organize my code? remove if self.trainer.is_global_zero: and let every process go through the logic what i present above ?
Thanks again.

jzhanghzau · 2024-04-16T16:30:50Z

@jzhanghzau self.log() issues collective calls, so you can't just call it on "rank zero" only. If you want to do that, pass self.log(rank_zero_only=True).
Here are the relevant docs for this: https://lightning.ai/docs/pytorch/stable/visualize/logging_advanced.html#rank-zero-only

Thanks for your quick reply! if i setself.log(rank_zero_only=True), it seems like that i am not allowed to use callback, what if i still want to use callback(earylstopping), how can i organize my code? remove if self.trainer.is_global_zero: and let every process go through the logic what i present above ? Thanks again.

@awaelchli

awaelchli · 2024-04-16T16:33:57Z

Yes, that would be another option.

jzhanghzau added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 16, 2024

github-actions bot added the ver: 2.1.x label Apr 16, 2024

awaelchli added question Further information is requested logging Related to the `LoggerConnector` and `log()` and removed bug Something isn't working labels Apr 16, 2024

awaelchli removed the needs triage Waiting to be triaged by maintainers label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP strategy doesn't work for on_validation_epoch_end, always hang #19783

DDP strategy doesn't work for on_validation_epoch_end, always hang #19783

jzhanghzau commented Apr 16, 2024 •

edited by github-actions bot

awaelchli commented Apr 16, 2024

jzhanghzau commented Apr 16, 2024

jzhanghzau commented Apr 16, 2024

awaelchli commented Apr 16, 2024

DDP strategy doesn't work for on_validation_epoch_end, always hang #19783

DDP strategy doesn't work for on_validation_epoch_end, always hang #19783

Comments

jzhanghzau commented Apr 16, 2024 • edited by github-actions bot

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Apr 16, 2024

jzhanghzau commented Apr 16, 2024

jzhanghzau commented Apr 16, 2024

awaelchli commented Apr 16, 2024

jzhanghzau commented Apr 16, 2024 •

edited by github-actions bot