Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
0 replies
-
xref: https://dask.discourse.group/t/dataframe-compute-hangs-during-iterating-a-dataset/1033 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to build a dataloader based on NVTabular, which provides a bunch of utilities to read data.
I need to integrate it with my pytorch code, but they don't give any examples comply with my demands. So I refer to this example script to build my dataloader.
I've met several problems, and issued in their repo. But I don't get any response.
Anyway, I struggled to find that the problem comes from this line. I modified it into the below form:
The problem is, during iterating my dataset, it would hang for several minutes at random iterations after printing the
before compute
.self._ddf
is a dask.dataframe and created by dask.dataframe.read_parquet, so I suppose this problem comes from dask somehow.Besides, this situation is very similar to the description in this issue: #2866.
Here's my sample code:
I launch the distributed environment by the below shell script:
The command line is
torchrun --nnode=1 --nproc_per_node=2 --no_python bash dist_wrapper.sh python test.py
.I know this is kind of awkward, but as explained in this tutorial:
I have to launch it this way, otherwise, all the processes would occupy the same GPU.
So I'm also wondering if the problem comes from some distributed setting conflicts under the hood.
Anyway, this problem is quite confusing, and I don't even have any idea how to locate it. I would extremely appreciate it if you guys could give me some hints.
Beta Was this translation helpful? Give feedback.
All reactions