Dataframe.compute hangs during iterating a dataset #9405

zxgx · 2022-08-19T06:20:22Z

zxgx
Aug 19, 2022

I'm trying to build a dataloader based on NVTabular, which provides a bunch of utilities to read data.
I need to integrate it with my pytorch code, but they don't give any examples comply with my demands. So I refer to this example script to build my dataloader.
I've met several problems, and issued in their repo. But I don't get any response.

Anyway, I struggled to find that the problem comes from this line. I modified it into the below form:

    def __iter__(self):
        for epoch in range(self.epochs):
            for i in self.indices:
                part = self._ddf.get_partition(i)
                if self.columns:
                    yield part[self.columns].compute(scheduler="synchronous")
                else:
                    print("before compute")
                    sample = part.compute(scheduler="synchronous")
                    print("after compute")
                    # yield part.compute(scheduler="synchronous")

The problem is, during iterating my dataset, it would hang for several minutes at random iterations after printing the before compute.
self._ddf is a dask.dataframe and created by dask.dataframe.read_parquet, so I suppose this problem comes from dask somehow.
Besides, this situation is very similar to the description in this issue: #2866.

Here's my sample code:

import time
import os
from tqdm import tqdm
import itertools

import torch
import torch.distributed as dist
from torch.utils.data import DataLoader
import nvtabular as nvt
from nvtabular.loader.torch import TorchAsyncItr    # , DLDataLoader

INPUT_DATA_DIR = "/data/criteo/train/"
BATCH_SIZE = int(os.environ.get("BATCH_SIZE", 16384))
CONTINUOUS_COLUMNS = ["int_" + str(x) for x in range(0, 13)]
CATEGORICAL_COLUMNS = ["cat_" + str(x) for x in range(0, 26)]
LABEL_COLUMNS = ["label"]


def run():
    os.environ["LOCAL_RANK"] = '0'

    dist.init_process_group(backend='nccl')

    fname = "part_{}.parquet"
    train_paths = [os.path.join(INPUT_DATA_DIR, fname.format(i)) for i in range(64)]

    print(f"{dist.get_rank()}/{dist.get_world_size()}: device: {torch.cuda.current_device()}")

    start = time.time()
    train_data = nvt.Dataset(train_paths, engine="parquet", part_size="128MB")
    print(f"nvdtaset: {time.time() - start}, is cpu: {train_data.cpu}")

    start = time.time()
    train_data_idrs = TorchAsyncItr(
        train_data,
        batch_size=BATCH_SIZE,
        cats=CATEGORICAL_COLUMNS,
        conts=CONTINUOUS_COLUMNS,
        labels=LABEL_COLUMNS,
        global_rank=0,
        global_size=1,
        drop_last=True,
        shuffle=True,
        seed_fn=lambda: 1,
    )
    print(f"TorchAsyncItr: {time.time() - start}, len: {len(train_data_idrs)}")

    start = time.time()
    train_dataloader = DataLoader(train_data_idrs,
                                  batch_size=None,
                                  pin_memory=False,
                                  num_workers=0)
    print(f"dataloader: {time.time() - start}, len: {len(train_dataloader)}")

    data_iter = iter(train_dataloader)
    for idx in tqdm(itertools.count(), desc=f"Rank {dist.get_rank()}", ncols=0,
                  total=len(train_dataloader) if hasattr(train_dataloader, "__len__") else None):
        batch = next(data_iter)

        if idx == 5:
            break
    torch.cuda.synchronize()


if __name__ == "__main__":
    os.environ["LIBCUDF_CUFILE_POLICY"] = "ALWAYS"
    run()

I launch the distributed environment by the below shell script:

if [ -z "${CUDA_VISIBLE_DEVICES:-}" ]; then
    export CUDA_VISIBLE_DEVICES=${LOCAL_RANK}
else
    device_list=(${CUDA_VISIBLE_DEVICES//","/ })
    export CUDA_VISIBLE_DEVICES=${device_list[$LOCAL_RANK]}
fi
export NVT_TAG=1
exec "$@"

The command line is torchrun --nnode=1 --nproc_per_node=2 --no_python bash dist_wrapper.sh python test.py.

I know this is kind of awkward, but as explained in this tutorial:

I have to launch it this way, otherwise, all the processes would occupy the same GPU.
So I'm also wondering if the problem comes from some distributed setting conflicts under the hood.

Anyway, this problem is quite confusing, and I don't even have any idea how to locate it. I would extremely appreciate it if you guys could give me some hints.