Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS #123

Open
belltailjp opened this issue May 7, 2020 · 3 comments

Comments

@belltailjp
Copy link
Member

belltailjp commented May 7, 2020

Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc.
They use multiprocessing module to launch worker processes using fork by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.

As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.

The workaround for this issue is to set multiprocessing module forkserver mode before having access to HDFS.
Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148

@belltailjp
Copy link
Member Author

related issue: #81

@kuenishi
Copy link
Member

kuenishi commented Mar 5, 2021

V2 API introduced a proactive fork detection before entering PyArrow functions by checking process ids, and when fork detected, it raises an exception by default. With vanilla Hdfs() class used, developers are now able to detect fork-after-hdfs-init as a bug, and then fix their code and introduce forkserver. What do you think?

@kuenishi
Copy link
Member

kuenishi commented Mar 5, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants