How to gather predict on ddp #5788
Replies: 25 comments 19 replies
-
You may have a look at |
Beta Was this translation helpful? Give feedback.
-
And for a specific usage, have a look at https://github.com/open-mmlab/mmdetection/blob/482f60fe55c364e50e4fc4b50893a25d8cc261b0/mmdet/apis/test.py#L160 |
Beta Was this translation helpful? Give feedback.
-
@karlind my sample code follow as: def test_step(self, batch, batch_idx):
init_ids = batch['init_ids']
attention_mask = batch['attention_mask']
token_type_ids = batch['token_type_ids']
predictions = self.model(init_ids, attention_mask, token_type_ids) |
Beta Was this translation helpful? Give feedback.
-
With DDP training, each GPU sees only their partition of the dataset, so each process can only evaluate a part of the dataset. You can use metrics package to automatic gather across all processes. |
Beta Was this translation helpful? Give feedback.
-
@iamkucuk |
Beta Was this translation helpful? Give feedback.
-
Hi @MarsSu0618 |
Beta Was this translation helpful? Give feedback.
-
Each process can predict part of the dataset, just predict as usual and gather all predicted results in |
Beta Was this translation helpful? Give feedback.
-
@karlind |
Beta Was this translation helpful? Give feedback.
-
Sure you can. https://github.com/open-mmlab/mmdetection/blob/482f60fe55c364e50e4fc4b50893a25d8cc261b0/mmdet/apis/test.py#L160 This function is a good example to collect results from multiple processes. |
Beta Was this translation helpful? Give feedback.
-
There is also a |
Beta Was this translation helpful? Give feedback.
-
@awaelchli BTW, I want to ask other questions about trainer.fit(). When i set trainer(gpus=-1, num_nodes=1, accelerator='ddp'...) It shows |
Beta Was this translation helpful? Give feedback.
-
same with ddp spawn? More details would be needed here. Make sure to run with the latest PL version. |
Beta Was this translation helpful? Give feedback.
-
@awaelchli Then, when i use |
Beta Was this translation helpful? Give feedback.
-
no it's not supported currently. you can load the pytorch dump and then write it to a csv.
But at which point did this occur? In your original message you wrote that you obtained some outputs so what changed since then? Some information how to reproduce would be needed here. |
Beta Was this translation helpful? Give feedback.
-
I have the same problem using validation_epoch_end I get half of results I do not need to save I need to do some further processing on (val_step_outputs) but I get half of results how to get all results from (val_step_outputs) then do what ever I want? |
Beta Was this translation helpful? Give feedback.
-
@MarsSu0618 Have you solved your problem? can you share the idea please? |
Beta Was this translation helpful? Give feedback.
-
Any solutions to this problem? |
Beta Was this translation helpful? Give feedback.
-
Bump, Also looking at this problem |
Beta Was this translation helpful? Give feedback.
-
I made an issue here for a tutorial. Contributions welcome :) <3 |
Beta Was this translation helpful? Give feedback.
-
Any update on this? |
Beta Was this translation helpful? Give feedback.
-
I am using ddp and a writer based on Any tutorials or examples how to properly write a predict_step or writer callback for multiGPUs? |
Beta Was this translation helpful? Give feedback.
-
Hi, I write a
Here is the code: import torch.distributed as dist
## in lightning module
def merge(self,outputs):
if dist.is_initialized():
all_rank_outputs = [None for _ in range(dist.get_world_size())]
dist.all_gather_object(all_rank_outputs,outputs)
outputs = [x for y in all_rank_outputs for x in y] ## all_rank_output[i]: i-th batch output
single_batch_output_cnt = len(outputs[0])
ret = [[] for _ in range(single_batch_output_cnt)]
for idx in range(single_batch_output_cnt):
for batch in outputs:
ret[idx].append(batch[idx])
return ret The common usage would be: def test_step(batch,batch_idx):
## calculate something here in multi-GPU
return [object1,object2]
## for single output
return (object1,)
def test_epoch_end(self,outputs):
## here outputs are a list of [batch1_output,batch2_output] from test_step
all_outputs = self.merge(outputs)
## if you return [object1,object2] in test_step
all_object1,all_object2 = all_output
## if you return [object1,] in test_step
all_object1 = all_output[0]
## here all_object1 is a list of [batch1_object1,batch2_object1,...] feel free to ask if there is any problem |
Beta Was this translation helpful? Give feedback.
-
Issue #16541 shows a clean example of how to use The official PyTorch Lightining documentation on |
Beta Was this translation helpful? Give feedback.
-
Here is my simple example for your reference. from lightning.pytorch.callbacks import BasePredictionWriter
class PredWriter(BasePredictionWriter):
def write_on_epoch_end(
self,
trainer: L.Trainer,
pl_module: L.LightningModule,
predictions: Any, # complex variables is ok
batch_indices: list[list[list[[int]]]],
) -> None:
gathered = [None] * torch.distributed.get_world_size()
torch.distributed.all_gather_object(gathered, predictions)
torch.distributed.barrier()
if not trainer.is_global_zero:
return
predictions = sum(gathered, [])
... |
Beta Was this translation helpful? Give feedback.
-
Problem
I encountered some questions about ddp. Because I train mode with ddp on 2 gpus.
And when i test and predict test dataloader on
test_step()
, the predict result just half data be predicted.ex:
How to solve it? use all_gather()?
Hope someone can answer, I will thanks a lot.
Beta Was this translation helpful? Give feedback.
All reactions