Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch distributed training, could not unlink the shared memory file #1483

Open
albertz opened this issue Dec 31, 2023 · 0 comments
Open

PyTorch distributed training, could not unlink the shared memory file #1483

albertz opened this issue Dec 31, 2023 · 0 comments

Comments

@albertz
Copy link
Member

albertz commented Dec 31, 2023

[2023-12-31 11:33:54,580] INFO: Start Job: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_100k-speedpertV2/train work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B> Task: run
...
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config']
Hostname: cn-237
...
ep 301 train, step 1177, ce 0.310, ctc_4 0.773, ctc_8 0.367, fer 0.043, mem_usage:cuda:3 7.6GB
ep 301 train, step 1178, ce 0.305, ctc_4 0.668, ctc_8 0.318, fer 0.052, mem_usage:cuda:2 7.6GB
ep 301 train, step 1178, ce 0.394, ctc_4 0.727, ctc_8 0.431, fer 0.067, mem_usage:cuda:0 7.6GB
ep 301 train, step 1178, ce 0.347, ctc_4 0.528, ctc_8 0.338, fer 0.044, mem_usage:cuda:1 7.5GB
ep 301 train, step 1178, ce 0.437, ctc_4 0.769, ctc_8 0.555, fer 0.078, mem_usage:cuda:3 7.6GB
Traceback (most recent call last):
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 428, in reduce_storage
    fd, size = storage._share_fd_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 330, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: could not unlink the shared memory file /torch_1871018_294927500_44734 : No such file or directory (2)
...
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete 
[2023-12-31 13:51:47,355] INFO: Run time: 2:17:52 CPU: 0.80% RSS: 34.63GB VMS: 475.21GB 
[2023-12-31 13:51:52,378] INFO: Run time: 2:17:57 CPU: 0.60% RSS: 21.74GB VMS: 347.22GB
[2023-12-31 13:51:53,986] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1868635 closing signal SIGTERM 
[2023-12-31 13:51:54,211] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1868636) of binary: /work/tools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant