CacheDataset with DDP and Multi-GPUs #11763

bill-yc-chen · 2022-02-05T19:45:35Z

bill-yc-chen
Feb 5, 2022

We use CacheDataset MONAI CacheDataset to speed up data loading. However, when combining the lightning module's standard training code with DDP strategy and multi-GPU environment, the cached dataset is not working as expected:

If provided with a full length of data in the CacheDataset, the initial epoch takes forever to load because each GPU will try to read in and cache ALL data, which is unnecessary because in DDP each GPU will only use a portion of the data.

A workaround is mentioned in here MONAI issue, which mentioning to partition data before feeding into the CacheDataset:
MONAI Tutorial

However, if I make the partitioning in the setup() function, the trainer will train for total_data_length // num_gpus samples each epoch instead of total_ data_length.

And if I put the CacheDataset with full data length in the prepare_data function, the subprocess's object can't access the dataset instance (saved in self.x, which is not recommended).

So what's the best practical way to handle this? My gut feeling is that I should use the partitioned dataset on each GPU, and let the loader use the full length of dataset instead of part of it. Any suggestions?

Answered by rohitgr7

Feb 5, 2022

hey @bill-yc-chen

since DDP executes scripts independently across devices, maybe try DDP_Spawn instead?
https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#sharing-datasets-across-process-boundaries

View full answer

rohitgr7 · 2022-02-05T21:54:57Z

rohitgr7
Feb 5, 2022

hey @bill-yc-chen

since DDP executes scripts independently across devices, maybe try DDP_Spawn instead?
https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#sharing-datasets-across-process-boundaries

3 replies

bill-yc-chen Feb 7, 2022
Author

@rohitgr7 Many thanks for the quick help.
I've tried to use ddp_spawn, but it gives me a lot of trouble in multithreading.
Eventually, I came to the solution that I partitioned the data before feeding it into the dataset on each GPU. Then, I removed the default DistributedSampler to make the GPU use all the data in the dataset.

jmarsil Apr 19, 2022

Do you have a sample script that shows how you do this? I'm in need of a similar solution. Much thanks!

R0na1D99 Jan 24, 2024

I solved this issue by setting Trainer(use_distributed_sampler=False) in lightning 2.0+ and using a manually split CacheDataset.

class DDP_CacheDataset(CacheDataset):
    """CacheDataset for DistributedDataParallel training."""
    def __init__(self, data_list: list, transform, cache_rate: float = 0, num_workers: int=0):
        """
        Args:
            data: data to cache.
            transform: transform to apply to data.
            cache_rate: cache rate for data.
            num_workers: number of workers for data.
        """
        if dist.is_initialized():
            part_data = self._get_part_data_list(data_list)
            print(f"Data count: {len(part_data)} for local rank {dist.get_rank()}, world size: {dist.get_world_size()}, total number: {len(data_list)}")
        else:
            part_data = data_list
        
        super().__init__(part_data, transform, cache_rate=cache_rate, num_workers=num_workers)

    def _get_part_data_list(self, data_list: list) -> list:
        """
        Get part of data list for each rank.
        Args:
            data_list: list of data.
        """
        return partition_dataset(
            data=data_list,
            num_partitions=dist.get_world_size(),
            shuffle=True,
            seed=42,
            drop_last=False,
            even_divisible=True,
        )[dist.get_rank()]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CacheDataset with DDP and Multi-GPUs #11763

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CacheDataset with DDP and Multi-GPUs #11763

bill-yc-chen Feb 5, 2022

Replies: 1 comment · 3 replies

rohitgr7 Feb 5, 2022

bill-yc-chen Feb 7, 2022 Author

jmarsil Apr 19, 2022

R0na1D99 Jan 24, 2024

bill-yc-chen
Feb 5, 2022

Replies: 1 comment 3 replies

rohitgr7
Feb 5, 2022

bill-yc-chen Feb 7, 2022
Author