How to gather predict on ddp #5788

MarsSu0618 · 2020-12-24T09:02:14Z

MarsSu0618
Dec 24, 2020

Problem

I encountered some questions about ddp. Because I train mode with ddp on 2 gpus.
And when i test and predict test dataloader on test_step(), the predict result just half data be predicted.
ex:

original test data: 10000
predict result : 5000 (1/gpus)

How to solve it? use all_gather()?
Hope someone can answer, I will thanks a lot.

karlind · 2020-12-24T11:49:44Z

karlind
Dec 24, 2020

You may have a look at torch.distributed.all_gather.

0 replies

karlind · 2020-12-24T11:51:29Z

karlind
Dec 24, 2020

And for a specific usage, have a look at https://github.com/open-mmlab/mmdetection/blob/482f60fe55c364e50e4fc4b50893a25d8cc261b0/mmdet/apis/test.py#L160

0 replies

MarsSu0618 · 2020-12-24T13:38:21Z

MarsSu0618
Dec 24, 2020
Author

@karlind my sample code follow as:
So i change init_ids, attention_mask, token_type_ids with torch.distributed.all_gather?

def test_step(self, batch, batch_idx):
        init_ids = batch['init_ids']
        attention_mask = batch['attention_mask']
        token_type_ids = batch['token_type_ids']
        predictions = self.model(init_ids, attention_mask, token_type_ids)

0 replies

iamkucuk · 2020-12-24T14:23:20Z

iamkucuk
Dec 24, 2020

With DDP training, each GPU sees only their partition of the dataset, so each process can only evaluate a part of the dataset. You can use metrics package to automatic gather across all processes.

0 replies

MarsSu0618 · 2020-12-24T15:36:15Z

MarsSu0618
Dec 24, 2020
Author

@iamkucuk
How to do and integrate pytorch-lightning? Thanks.
I still not solve it and i am very stucks ....

0 replies

iamkucuk · 2020-12-24T17:21:39Z

iamkucuk
Dec 24, 2020

Hi @MarsSu0618
Here is the documentation for the metrics package. You can use a existing metric, or create one for yourself. It does all the gathering for you.

0 replies

karlind · 2020-12-25T07:16:07Z

karlind
Dec 25, 2020

Each process can predict part of the dataset, just predict as usual and gather all predicted results in validation_epoch_end or test_epoch_end. After that, evaluate with the whole results in just one process. You will get the exact performance. Or if you just want to get the whole predicted results, just gather results from different processes then save it.

1 reply

icoz69 Mar 30, 2022

hi, how do you evaluate in one process? it seems test_epoch_end is run N times, where N is the num of GPU

MarsSu0618 · 2020-12-25T07:21:56Z

MarsSu0618
Dec 25, 2020
Author

@karlind
Because i want to write predict result to file. So I also can use this method to process it?
And how to gather in this situation. use all_gather()? Thank a lot. it stucks..

0 replies

karlind · 2020-12-25T07:33:22Z

karlind
Dec 25, 2020

Sure you can. https://github.com/open-mmlab/mmdetection/blob/482f60fe55c364e50e4fc4b50893a25d8cc261b0/mmdet/apis/test.py#L160 This function is a good example to collect results from multiple processes.

0 replies

awaelchli · 2020-12-26T07:27:09Z

awaelchli
Dec 26, 2020
Maintainer

There is also a LightningModule.write_predictions method

0 replies

MarsSu0618 · 2020-12-28T01:22:06Z

MarsSu0618
Dec 28, 2020
Author

@awaelchli
But i haven't seen LightningModule.write_predictions how to use in document?

BTW, I want to ask other questions about trainer.fit(). When i set ddp in

trainer(gpus=-1, num_nodes=1, accelerator='ddp'...)

It shows initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2, then always hangs. How to solve it?

0 replies

awaelchli · 2020-12-28T09:32:44Z

awaelchli
Dec 28, 2020
Maintainer

self.write_prediction(name="name", value=some_tensor, filename='predictions.pt')
it's not documented unfortunately.
You can then load the file later and inside is a dict with the name and value pairs.

It shows initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2, then always hangs. How to solve it?

same with ddp spawn? More details would be needed here. Make sure to run with the latest PL version.

0 replies

MarsSu0618 · 2020-12-28T10:57:15Z

MarsSu0618
Dec 28, 2020
Author

@awaelchli
Thanks for reply. But i want to check this method which can be use csv file?

Then, when i use ddp spawn still have the same problem and hang for a while. My pl version is 1.1.1

0 replies

awaelchli · 2020-12-29T11:54:58Z

awaelchli
Dec 29, 2020
Maintainer

Thanks for reply. But i want to check this method which can be use csv file?

no it's not supported currently. you can load the pytorch dump and then write it to a csv.

Then, when i use ddp spawn still have the same problem and hang for a while

But at which point did this occur? In your original message you wrote that you obtained some outputs so what changed since then? Some information how to reproduce would be needed here.

0 replies

Arij-Aladel · 2021-01-27T05:18:29Z

Arij-Aladel
Jan 27, 2021

I have the same problem using validation_epoch_end I get half of results I do not need to save I need to do some further processing on (val_step_outputs) but I get half of results how to get all results from (val_step_outputs) then do what ever I want?

0 replies

Arij-Aladel · 2021-01-29T08:21:58Z

Arij-Aladel
Jan 29, 2021

@karlind
Because i want to write predict result to file. So I also can use this method to process it?
And how to gather in this situation. use all_gather()? Thank a lot. it stucks..

@MarsSu0618 Have you solved your problem? can you share the idea please?

1 reply

swenson-nick Sep 8, 2021

bump. i'd also like to know how to solve this

@MarsSu0618 @Arij-Aladel

swenson-nick · 2021-09-08T20:26:57Z

swenson-nick
Sep 8, 2021

Any solutions to this problem?

0 replies

jmerkow · 2021-10-06T22:19:07Z

jmerkow
Oct 6, 2021

Bump, Also looking at this problem

0 replies

awaelchli · 2021-10-11T13:51:36Z

awaelchli
Oct 11, 2021
Maintainer

I made an issue here for a tutorial. Contributions welcome :) <3

9 replies

four4fish Jan 21, 2022

Also small correction, these are not threads, just regular processes.

I know but easier to say

Will trainer.training_type_plugin.barrier() even it its not ddp? do do i need to check for plugin type

@jmerkow Lightning support collective for all plugin types. One thing worth to mention : collective only works after the process group initialized, so only collective calls after setup() hook will work

four4fish Jan 21, 2022

@awaelchli I tried something like

def write_on_epoch_end(
    self,
    trainer,
    pl_module: LightningModule,
    predictions: List[Any],
    batch_indices: List[Any],
):
    trainer.training_type_plugin.barrier()

    if not trainer.is_global_zero:
        return

    predictions = pl_module.all_gather(predictions)

But then the main process hangs and I see all subprocesses exited. Am I doing something wrong? Thanks!

@Yevgnen Reading the code, seems subprocesses will exit this function at "if not trainer.is_global_zero:", then you called "all_gather" which requires all processes to participate. Because the subprocesses already exited this function, the "all_gather" will timeout.

Yevgnen Jan 22, 2022

@four4fish Thanks for the explanation. Is there any official example teaching the best way to collect all predictions now? Should I do it in the PredicitonWriter or LightningModule?

abhijeetdhakane Oct 12, 2022

Do you have any solution? in same boat now

emptymalei Nov 1, 2022

Here in the tutorial,

class PredictionWriter(BasePredictionWriter):
    def __init__(self, output_dir: str):
        super().__init__(write_interval="epoch")
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)

    def write_on_epoch_end(self, trainer, pl_module: "LightningModule", predictions, batch_indices):
        predictions = torch.cat(predictions[0]).cpu()
        torch.save(predictions, os.path.join(self.output_dir, f"predictions-{trainer.global_rank}.pt"))

what is the reason for taking only predictions[0]?

Yevgnen · 2022-01-20T08:55:03Z

Yevgnen
Jan 20, 2022

Bump!

0 replies

marcmk6 · 2022-08-05T09:36:57Z

marcmk6
Aug 5, 2022

Any update on this?
I tried to override the hook on_predict_epoch_end to gather predictions from all devices but it receives nothing. I'm using ddp.

0 replies

emptymalei · 2022-11-01T15:16:35Z

emptymalei
Nov 1, 2022

I am using ddp and a writer based on BasePredictionWriter. I got duplicate results when using 2 GPUs compared to single GPU prediction.

Any tutorials or examples how to properly write a predict_step or writer callback for multiGPUs?

6 replies

CielAl Mar 25, 2024

I tried different solutions above and @karlind 's example could be the most effective (which simply convert the data) as lightning's gather function supports both tensor and containers that have tensor (e.g., list, dict, etc.) and it will be more straightforward to convert the entire object to bytes stream tensor, pad the length, gather, and convert back to the object.

Here is my implementation if you are interested: https://github.com/CielAl/SimCLR-Lightning/blob/main/simclr_lightning/models/callbacks/gather.py

Lepecin Mar 25, 2024

I have a similar issue when using BasePredictionWriter to predict with a model trained on 8 GPUs using DDP. I'm saving my predictions as .csv files using polars:

class CustomWriter(BasePredictionWriter):

    def __init__(
        self,
        output_directory: str,
        write_interval: Literal["batch", "epoch", "batch_and_epoch"] = "batch",
    ) -> None:
        super().__init__(write_interval)
        self.output_directory = output_directory

    def write_on_epoch_end(
        self,
        trainer: Trainer,
        pl_module: LightningModule,
        predictions: Sequence[dict[str, torch.Tensor]],
        batch_indices: Sequence[Any],
    ) -> None:

        if trainer.logger is not None:
            version = trainer.logger.version

        polars.concat(
            [
                polars.DataFrame(
                    {key: tensor.flatten().numpy() for key, tensor in datum.items()}
                )
                for datum in predictions
            ]
        ).write_csv(
            pathlib.Path(self.output_directory)
            / f"prediction_{version}_{trainer.global_rank}.csv"
        )

However, two or more of the saved .csv files will be populated with data from the same dataloader partition :/ So, I am losing some partitions while duplicating others.

Is the trainer actually incorrectly partitioning the dataloader, or is there some incompatibility between polars and pytorch lightning?

Ttayu Mar 25, 2024

Should only be saved in the main process(global rank == 0).

#5788 (comment) (This could be old.)

Lepecin Mar 25, 2024

@Ttayu Ok, so I implemented your example, but the gathered data still contains duplicates and is still missing partitions from the prediction dataloader. This is not a problem when predicting only on one device where all predictions are given.

Here is my code for the datamodule:

class CustomModule(LightningDataModule):

    def __init__(
        self,
        config: CustomConfig,
        batch_size: int,
        num_workers: int,
    ) -> None:
        self.config = config
        self.batch_size = batch_size
        self.num_workers = num_workers
        super().__init__()

    def prepare_data(self) -> None:
        CustomDataset(config=self.config, save_data=True)

    def setup(self, stage: Literal["fit", "validate", "test", "predict"]) -> None:
        match stage:
            case "fit":
                self.train = CustomDataset(config=self.config, type="train")
                self.valid = CustomDataset(config=self.config, type="valid")
            case "validate":
                self.valid = CustomDataset(config=self.config, type="valid")
            case "test" | "predict":
                self.test = CustomDataset(config=self.config, type="test")
            case _:
                raise ValueError("Invalid stage input")

    def train_dataloader(self) -> DataLoader:
        return DataLoader(
            self.train,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            shuffle=True,
        )

    def val_dataloader(self) -> DataLoader:
        return DataLoader(
            self.valid, batch_size=self.batch_size, num_workers=self.num_workers
        )

    def predict_dataloader(self) -> DataLoader:
        return DataLoader(self.test, batch_size=1000, num_workers=self.num_workers)

And my code for the training script:

seed_everything(seed, True)

output_directory = pathlib.PurePath(paths["output_directory"])
strategy = DDPStrategy(find_unused_parameters=True)
logger = CSVLogger(str(output_directory.parent), output_directory.name)
stopper = EarlyStopping("val_loss", patience=config["patience"], verbose=True)
prediction_writer = CustomWriter(data_config, str(output_directory), "epoch")
trainer = Trainer(
    max_epochs=config["epochs"],
    logger=logger,
    callbacks=[stopper, prediction_writer],
    strategy=DDPStrategy(find_unused_parameters=True),
    sync_batchnorm=True,
)

datamodule = CustomModule(
    data_config,
    hyperparameters["batch_size"] * trainer.num_devices,
    config["num_workers"],
)

model = ...

trainer.fit(model, datamodule=datamodule)
trainer.predict(
    model,
    datamodule=datamodule,
    ckpt_path="best",
    return_predictions=False,
)

CielAl Mar 25, 2024

You still need to use all_gather related functions (lightning's API and torch's API.

I assume you already know that but just in case, when you use ddp and multiple devices are involved, the below things may happen:
(1) each dataloader on a device is a but partition of the whole dataset.
(2) length of all data chunks must be equal, and therefore if the number of batches cannot be divided by the number of devices, duplicates may occur so that each device works on the dataloader of the same size.
(3) When you finally write the entire prediction data to local disk, as @Ttayu suggests, it should happen on the main process( global_rank==0).
(4) You need to use all_gather or related functions to collect results from multiple device to your main process to output the full prediction.

Hannibal046 · 2022-11-19T10:20:46Z

Hannibal046
Nov 19, 2022

Hi, I write a merge function, which could:

handle both single-GPU and multi-GPU
gather anything from multi-GPU including Tensor,List or Dict.

Here is the code:

import torch.distributed as dist
## in lightning module
def merge(self,outputs):
    if dist.is_initialized():
        all_rank_outputs = [None for _ in range(dist.get_world_size())]    
        dist.all_gather_object(all_rank_outputs,outputs)
        outputs = [x for y in all_rank_outputs for x in y] ## all_rank_output[i]: i-th batch output
    single_batch_output_cnt = len(outputs[0])
    ret = [[] for _ in range(single_batch_output_cnt)]
    for idx in range(single_batch_output_cnt):
        for batch in outputs:
            ret[idx].append(batch[idx])
    return ret

The common usage would be:

def test_step(batch,batch_idx):
  	## calculate something here in multi-GPU
    return [object1,object2]
  	## for single output
    return (object1,)

  
def test_epoch_end(self,outputs):
  ## here outputs are a list of [batch1_output,batch2_output] from test_step
  all_outputs = self.merge(outputs)
  ## if you return [object1,object2] in test_step
  all_object1,all_object2 = all_output
  ## if you return [object1,] in test_step
  all_object1 = all_output[0]
  
  ## here all_object1 is a list of [batch1_object1,batch2_object1,...]

feel free to ask if there is any problem

0 replies

itsnamgyu · 2023-03-08T13:41:15Z

itsnamgyu
Mar 8, 2023

Issue #16541 shows a clean example of how to use LightningModule.all_gather inside LightningModule.test_epoch_end (the comment on the issue also points out a common mistake).

The official PyTorch Lightining documentation on all_gather seems to be a bit vague on this.

2 replies

awaelchli Mar 8, 2023
Maintainer

I agree, we should make the documentation explain what it does rather than why it is there :))

CielAl Oct 1, 2023

Might not be the best example since [train/validate/test/...]_epoch_end hooks are depcrecated here #8731.

Ttayu · 2023-03-24T07:13:35Z

Ttayu
Mar 24, 2023

Here is my simple example for your reference.

from lightning.pytorch.callbacks import BasePredictionWriter


class PredWriter(BasePredictionWriter):
    def write_on_epoch_end(
        self,
        trainer: L.Trainer,
        pl_module: L.LightningModule,
        predictions: Any,  # complex variables is ok
        batch_indices: list[list[list[[int]]]],
    ) -> None:
        gathered = [None] * torch.distributed.get_world_size()
        torch.distributed.all_gather_object(gathered, predictions)
        torch.distributed.barrier()
        if not trainer.is_global_zero:
            return
        predictions = sum(gathered, [])
        ...

0 replies

How to gather predict on ddp #5788

Problem

Replies: 25 comments · 19 replies

MarsSu0618 Dec 24, 2020 Author

MarsSu0618 Dec 24, 2020 Author

MarsSu0618 Dec 25, 2020 Author

awaelchli Dec 26, 2020 Maintainer

MarsSu0618 Dec 28, 2020 Author

awaelchli Dec 28, 2020 Maintainer

MarsSu0618 Dec 28, 2020 Author

awaelchli Dec 29, 2020 Maintainer

awaelchli Oct 11, 2021 Maintainer

awaelchli Mar 8, 2023 Maintainer

Replies: 25 comments 19 replies

MarsSu0618
Dec 24, 2020
Author

MarsSu0618
Dec 24, 2020
Author

MarsSu0618
Dec 25, 2020
Author

awaelchli
Dec 26, 2020
Maintainer

MarsSu0618
Dec 28, 2020
Author

awaelchli
Dec 28, 2020
Maintainer

MarsSu0618
Dec 28, 2020
Author

awaelchli
Dec 29, 2020
Maintainer

awaelchli
Oct 11, 2021
Maintainer

awaelchli Mar 8, 2023
Maintainer