We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2023-12-31 11:33:54,580] INFO: Start Job: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_100k-speedpertV2/train work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B> Task: run ... RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config'] Hostname: cn-237 ... ep 301 train, step 1177, ce 0.310, ctc_4 0.773, ctc_8 0.367, fer 0.043, mem_usage:cuda:3 7.6GB ep 301 train, step 1178, ce 0.305, ctc_4 0.668, ctc_8 0.318, fer 0.052, mem_usage:cuda:2 7.6GB ep 301 train, step 1178, ce 0.394, ctc_4 0.727, ctc_8 0.431, fer 0.067, mem_usage:cuda:0 7.6GB ep 301 train, step 1178, ce 0.347, ctc_4 0.528, ctc_8 0.338, fer 0.044, mem_usage:cuda:1 7.5GB ep 301 train, step 1178, ce 0.437, ctc_4 0.769, ctc_8 0.555, fer 0.078, mem_usage:cuda:3 7.6GB Traceback (most recent call last): File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 428, in reduce_storage fd, size = storage._share_fd_cpu_() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper return fn(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 330, in _share_fd_cpu_ return super()._share_fd_cpu_(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: could not unlink the shared memory file /torch_1871018_294927500_44734 : No such file or directory (2) ... RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete [2023-12-31 13:51:47,355] INFO: Run time: 2:17:52 CPU: 0.80% RSS: 34.63GB VMS: 475.21GB [2023-12-31 13:51:52,378] INFO: Run time: 2:17:57 CPU: 0.60% RSS: 21.74GB VMS: 347.22GB [2023-12-31 13:51:53,986] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1868635 closing signal SIGTERM [2023-12-31 13:51:54,211] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1868636) of binary: /work/tools
The text was updated successfully, but these errors were encountered:
No branches or pull requests
The text was updated successfully, but these errors were encountered: