PyTorch distributed training CPU OOM with sync_on_cpu #1490

albertz · 2024-01-06T10:45:20Z

It trains fine for a while, and then often I get a CPU OOM, which looks like:

[2024-01-04 11:41:05,662] INFO: Start Job: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> Task: run
...
RETURNN starting up, version 1.20240104.103023+git.a0453738, date/time 2024-01-04-11-41-18 (UTC+0000), pid 2544354, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8/output/returnn.config']
...
MEMORY: total (main 2544354, 2024-01-04, 11:42:31, 20 procs): pss=8.2GB uss=5.9GB
MEMORY: total (main 2544357, 2024-01-04, 11:42:31, 20 procs): pss=8.2GB uss=5.8GB
MEMORY: total (main 2544355, 2024-01-04, 11:42:31, 20 procs): pss=8.1GB uss=5.9GB
MEMORY: total (main 2544356, 2024-01-04, 11:42:31, 20 procs): pss=8.2GB uss=5.9GB
[2024-01-04 11:42:31,620] INFO: Run time: 0:01:25 CPU: 0.80% RSS: 49.03GB VMS: 1012.12GB
ep 1 train, step 0, acc 0.000e+00, loss 250.549, loss_att 128.992, loss_ctc 534.183, total 36.393, mem_usage:cuda:1 4.9GB, 8.291 sec/step
ep 1 train, step 0, acc 0.000e+00, loss 254.590, loss_att 128.702, loss_ctc 548.329, total 31.016, mem_usage:cuda:3 4.6GB, 8.113 sec/step
ep 1 train, step 0, acc 0.000e+00, loss 243.085, loss_att 112.959, loss_ctc 546.714, total 29.698, mem_usage:cuda:2 4.8GB, 8.239 sec/step
ep 1 train, step 0, acc 0.000e+00, loss 229.601, loss_att 104.556, loss_ctc 521.373, total 32.120, mem_usage:cuda:0 4.7GB, 8.980 sec/step
ep 1 train, step 1, acc 0.003, loss 282.783, loss_att 128.901, loss_ctc 641.842, total 24.830, mem_usage:cuda:0 5.7GB, 0.494 sec/step
ep 1 train, step 1, acc 0.000e+00, loss 269.690, loss_att 144.347, loss_ctc 562.155, total 28.895, mem_usage:cuda:1 5.7GB, 0.532 sec/step
...
ep 74 train, step 96, acc 0.762, loss 84.813, loss_att 74.792, loss_ctc 108.193, total 1.705, mem_usage:cuda:0 7.4GB, 0.681 sec/step
ep 74 train, step 96, acc 0.787, loss 78.256, loss_att 71.056, loss_ctc 95.056, total 1.488, mem_usage:cuda:2 7.4GB, 0.684 sec/step
ep 74 train, step 96, acc 0.693, loss 84.419, loss_att 77.122, loss_ctc 101.445, total 1.859, mem_usage:cuda:3 7.6GB, 0.653 sec/step
ep 74 train, step 97, acc 0.778, loss 73.275, loss_att 65.378, loss_ctc 91.699, total 1.389, mem_usage:cuda:0 7.4GB, 0.616 sec/step
ep 74 train, step 97, acc 0.746, loss 88.487, loss_att 76.140, loss_ctc 117.297, total 1.573, mem_usage:cuda:1 7.6GB, 0.671 sec/step
ep 74 train, step 97, acc 0.795, loss 53.753, loss_att 48.693, loss_ctc 65.560, total 1.064, mem_usage:cuda:2 7.4GB, 0.687 sec/step
ep 74 train, step 97, acc 0.753, loss 62.004, loss_att 58.497, loss_ctc 70.186, total 1.416, mem_usage:cuda:3 7.6GB, 0.677 sec/step
ep 74 train, step 98, acc 0.811, loss 66.990, loss_att 61.388, loss_ctc 80.060, total 1.294, mem_usage:cuda:0 7.4GB, 0.595 sec/step
ep 74 train, step 98, acc 0.754, loss 79.522, loss_att 73.001, loss_ctc 94.737, total 1.665, mem_usage:cuda:2 7.4GB, 0.568 sec/step
ep 74 train, step 98, acc 0.742, loss 81.392, loss_att 70.114, loss_ctc 107.708, total 1.604, mem_usage:cuda:1 7.6GB, 0.697 sec/step
ep 74 train, step 98, acc 0.823, loss 59.725, loss_att 52.744, loss_ctc 76.014, total 1.244, mem_usage:cuda:3 7.6GB, 0.684 sec/step
[2024-01-06 03:53:33,673] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2544354 closing signal SIGTERM
[2024-01-06 03:53:33,712] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2544356 closing signal SIGTERM
[2024-01-06 03:53:33,713] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2544357 closing signal SIGTERM
[2024-01-06 03:53:34,146] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 2544355) of binary: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/run.py", line 810, in <module> 
    main()
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-06_03:53:33
  host      : cn-235.informatik.rwth-aachen.de
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 2544355)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2544355
============================================================
...
slurmstepd-cn-235: error: Detected 1 oom-kill event(s) in StepId=3967412.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

This is with "torch_distributed": {"reduce_type": "param", "param_sync_step": 100}, and you see in the log that it happens in step 99 (which was not logged yet, as it is in the step), which is the step where it does the param sync.

As you see, it ran for quite a while and trained 73 (sub)epochs.

The memory logs:

MEMORY: total (main 2544357, 2024-01-04, 11:41:19, 3 procs): pss=0.8GB uss=769.5MB
MEMORY: total (main 2544355, 2024-01-04, 11:41:19, 3 procs): pss=0.8GB uss=770.0MB
MEMORY: total (main 2544356, 2024-01-04, 11:41:19, 3 procs): pss=0.8GB uss=769.1MB
MEMORY: total (main 2544354, 2024-01-04, 11:41:19, 3 procs): pss=0.8GB uss=769.5MB
MEMORY: total (main 2544357, 2024-01-04, 11:41:24, 8 procs): pss=1.2GB uss=1.1GB
MEMORY: total (main 2544356, 2024-01-04, 11:41:24, 8 procs): pss=1.2GB uss=1.1GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:24, 8 procs): pss=1.2GB uss=1.1GB
MEMORY: total (main 2544355, 2024-01-04, 11:41:24, 8 procs): pss=1.2GB uss=1.1GB
MEMORY: total (main 2544357, 2024-01-04, 11:41:29, 8 procs): pss=1.7GB uss=1.6GB
MEMORY: total (main 2544356, 2024-01-04, 11:41:29, 8 procs): pss=1.7GB uss=1.6GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:29, 8 procs): pss=1.7GB uss=1.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:41:29, 8 procs): pss=1.7GB uss=1.6GB
MEMORY: total (main 2544357, 2024-01-04, 11:41:35, 14 procs): pss=2.8GB uss=2.0GB
MEMORY: total (main 2544356, 2024-01-04, 11:41:35, 19 procs): pss=3.0GB uss=2.1GB
MEMORY: total (main 2544355, 2024-01-04, 11:41:35, 19 procs): pss=3.0GB uss=2.2GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:35, 19 procs): pss=2.6GB uss=2.1GB
MEMORY: total (main 2544357, 2024-01-04, 11:41:41, 19 procs): pss=3.4GB uss=2.6GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:41, 19 procs): pss=3.1GB uss=2.6GB
MEMORY: total (main 2544356, 2024-01-04, 11:41:41, 19 procs): pss=3.4GB uss=2.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:41:41, 19 procs): pss=3.5GB uss=2.6GB
MEMORY: total (main 2544357, 2024-01-04, 11:41:46, 19 procs): pss=3.8GB uss=3.0GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:46, 19 procs): pss=2.9GB uss=2.1GB
MEMORY: total (main 2544356, 2024-01-04, 11:41:46, 19 procs): pss=3.8GB uss=3.0GB
MEMORY: total (main 2544355, 2024-01-04, 11:41:46, 19 procs): pss=3.8GB uss=3.0GB
MEMORY: total (main 2544357, 2024-01-04, 11:41:52, 19 procs): pss=4.5GB uss=3.7GB
MEMORY: total (main 2544354, 2024-01-04, 11:41:52, 19 procs): pss=3.6GB uss=2.8GB
...
MEMORY: total (main 2544357, 2024-01-04, 11:42:05, 20 procs): pss=6.5GB uss=4.6GB
MEMORY: total (main 2544354, 2024-01-04, 11:42:05, 20 procs): pss=5.5GB uss=2.9GB
MEMORY: total (main 2544355, 2024-01-04, 11:42:06, 20 procs): pss=6.4GB uss=4.7GB
MEMORY: total (main 2544357, 2024-01-04, 11:42:12, 20 procs): pss=7.4GB uss=5.5GB
MEMORY: total (main 2544356, 2024-01-04, 11:42:12, 20 procs): pss=7.1GB uss=5.3GB
MEMORY: total (main 2544354, 2024-01-04, 11:42:12, 20 procs): pss=6.4GB uss=3.8GB
...
MEMORY: total (main 2544357, 2024-01-04, 11:42:36, 20 procs): pss=8.4GB uss=6.1GB
MEMORY: total (main 2544355, 2024-01-04, 11:42:36, 20 procs): pss=8.3GB uss=6.1GB
MEMORY: total (main 2544356, 2024-01-04, 11:42:37, 20 procs): pss=8.4GB uss=6.1GB
MEMORY: total (main 2544355, 2024-01-04, 11:42:49, 20 procs): pss=8.3GB uss=6.1GB
MEMORY: total (main 2544357, 2024-01-04, 11:42:49, 20 procs): pss=8.4GB uss=6.1GB
MEMORY: total (main 2544354, 2024-01-04, 11:42:49, 20 procs): pss=8.4GB uss=6.1GB
MEMORY: total (main 2544356, 2024-01-04, 11:42:49, 20 procs): pss=8.4GB uss=6.1GB
MEMORY: total (main 2544357, 2024-01-04, 11:42:55, 20 procs): pss=8.4GB uss=6.1GB
...
MEMORY: total (main 2544356, 2024-01-04, 11:44:16, 20 procs): pss=9.7GB uss=7.3GB
MEMORY: total (main 2544354, 2024-01-04, 11:44:53, 20 procs): pss=9.7GB uss=7.3GB
MEMORY: total (main 2544357, 2024-01-04, 11:44:58, 20 procs): pss=9.8GB uss=7.5GB
MEMORY: total (main 2544356, 2024-01-04, 11:46:12, 20 procs): pss=9.9GB uss=7.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:46:13, 20 procs): pss=9.8GB uss=7.6GB
MEMORY: total (main 2544354, 2024-01-04, 11:46:13, 20 procs): pss=9.8GB uss=7.5GB
MEMORY: total (main 2544355, 2024-01-04, 11:47:27, 20 procs): pss=9.9GB uss=7.7GB
MEMORY: total (main 2544354, 2024-01-04, 11:48:10, 20 procs): pss=9.9GB uss=7.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:48:15, 20 procs): pss=9.9GB uss=7.7GB
MEMORY: total (main 2544356, 2024-01-04, 11:48:44, 20 procs): pss=10.0GB uss=7.7GB
MEMORY: total (main 2544356, 2024-01-04, 11:51:04, 20 procs): pss=10.0GB uss=7.8GB
MEMORY: total (main 2544357, 2024-01-04, 11:51:06, 20 procs): pss=10.0GB uss=7.7GB
...
MEMORY: total (main 2544357, 2024-01-04, 11:57:56, 22 procs): pss=14.1GB uss=8.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:58:02, 22 procs): pss=14.7GB uss=9.3GB
MEMORY: total (main 2544354, 2024-01-04, 11:58:02, 22 procs): pss=14.9GB uss=9.3GB
MEMORY: total (main 2544356, 2024-01-04, 11:58:03, 22 procs): pss=14.8GB uss=9.2GB
MEMORY: total (main 2544357, 2024-01-04, 11:58:04, 22 procs): pss=15.1GB uss=9.6GB
MEMORY: total (main 2544355, 2024-01-04, 11:58:09, 22 procs): pss=15.8GB uss=10.4GB
MEMORY: total (main 2544354, 2024-01-04, 11:58:09, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544356, 2024-01-04, 11:58:09, 22 procs): pss=16.0GB uss=10.4GB
MEMORY: total (main 2544357, 2024-01-04, 11:58:10, 22 procs): pss=15.8GB uss=10.4GB
MEMORY: total (main 2544354, 2024-01-04, 11:58:15, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 11:58:22, 22 procs): pss=16.0GB uss=10.5GB
...
MEMORY: total (main 2544354, 2024-01-04, 11:58:28, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 11:58:47, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 12:31:59, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 13:23:08, 22 procs): pss=16.0GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 13:39:55, 22 procs): pss=16.1GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 13:40:01, 22 procs): pss=16.1GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 13:40:08, 22 procs): pss=16.1GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 14:13:13, 22 procs): pss=16.1GB uss=10.5GB
MEMORY: total (main 2544354, 2024-01-04, 22:06:46, 22 procs): pss=15.6GB uss=10.0GB
MEMORY: total (main 2544354, 2024-01-05, 03:02:28, 22 procs): pss=15.7GB uss=10.1GB
MEMORY: total (main 2544357, 2024-01-05, 03:59:34, 22 procs): pss=15.4GB uss=10.0GB
MEMORY: total (main 2544355, 2024-01-05, 04:16:00, 22 procs): pss=15.5GB uss=10.1GB
MEMORY: total (main 2544356, 2024-01-05, 04:47:07, 22 procs): pss=15.6GB uss=10.0GB
MEMORY: total (main 2544355, 2024-01-05, 11:04:54, 22 procs): pss=15.6GB uss=10.2GB
MEMORY: total (main 2544357, 2024-01-05, 11:12:50, 22 procs): pss=15.5GB uss=10.1GB
MEMORY: total (main 2544354, 2024-01-05, 11:46:53, 22 procs): pss=15.8GB uss=10.2GB
MEMORY: total (main 2544356, 2024-01-05, 13:41:56, 22 procs): pss=15.7GB uss=10.1GB
MEMORY: total (main 2544357, 2024-01-05, 19:55:04, 22 procs): pss=15.6GB uss=10.2GB
MEMORY: total (main 2544355, 2024-01-05, 21:10:55, 22 procs): pss=15.7GB uss=10.3GB
MEMORY: total (main 2544354, 2024-01-05, 21:25:48, 22 procs): pss=15.9GB uss=10.3GB
MEMORY: total (main 2544356, 2024-01-05, 23:17:33, 22 procs): pss=15.8GB uss=10.2GB
MEMORY: total (main 2544354, 2024-01-06, 02:41:35, 22 procs): pss=15.9GB uss=10.3GB
MEMORY: total (main 2544354, 2024-01-06, 02:41:42, 22 procs): pss=15.9GB uss=10.3GB
MEMORY: total (main 2544355, 2024-01-06, 02:46:02, 22 procs): pss=15.7GB uss=10.4GB
MEMORY: total (main 2544355, 2024-01-06, 03:53:34, 0 procs): pss=0B uss=0B
MEMORY: total (main 2544357, 2024-01-06, 03:53:37, 0 procs): pss=0B uss=0B
MEMORY: total (main 2544356, 2024-01-06, 03:53:37, 0 procs): pss=0B uss=0B
MEMORY: total (main 2544354, 2024-01-06, 03:53:38, 0 procs): pss=0B uss=0B

I'm not exactly sure whether this is a memory leak or whether it just saturates.

The memory does indeed increase at the first sync:

ep 1 train, step 97, acc 0.159, loss 389.274, loss_att 392.819, loss_ctc 381.003, total 8.728, mem_usage:cuda:3 7.4GB, 0.686 sec/step
ep 1 train, step 97, acc 0.137, loss 403.538, loss_att 405.905, loss_ctc 398.017, total 8.660, mem_usage:cuda:2 7.4GB, 0.718 sec/step
MEMORY: sub proc watch memory(2544486) increased RSS: rss=52.5MB pss=31.0MB uss=30.8MB shared=21.7MB
ep 1 train, step 98, acc 0.081, loss 361.590, loss_att 359.820, loss_ctc 365.719, total 8.528, mem_usage:cuda:1 7.4GB, 0.672 sec/step
ep 1 train, step 98, acc 0.044, loss 321.847, loss_att 320.063, loss_ctc 326.011, total 8.381, mem_usage:cuda:0 7.4GB, 0.749 sec/step
ep 1 train, step 98, acc 0.162, loss 374.532, loss_att 377.639, loss_ctc 367.282, total 8.670, mem_usage:cuda:2 7.4GB, 0.706 sec/step
ep 1 train, step 98, acc 0.219, loss 363.171, loss_att 369.138, loss_ctc 349.246, total 9.034, mem_usage:cuda:3 7.4GB, 0.722 sec/step
MEMORY: total (main 2544357, 2024-01-04, 11:43:38, 20 procs): pss=8.5GB uss=6.1GB
[2024-01-04 11:43:42,166] INFO: Run time: 0:02:36 CPU: 0.80% RSS: 54.73GB VMS: 1.01TB
MEMORY: main proc python3.11(2544357) increased RSS: rss=5.1GB pss=3.4GB uss=2.5GB shared=2.5GB
MEMORY: main proc python3.11(2544355) increased RSS: rss=5.1GB pss=3.5GB uss=2.5GB shared=2.5GB
MEMORY: main proc python3.11(2544354) increased RSS: rss=5.1GB pss=3.5GB uss=2.5GB shared=2.5GB
MEMORY: main proc python3.11(2544356) increased RSS: rss=5.1GB pss=3.5GB uss=2.5GB shared=2.5GB
MEMORY: total (main 2544355, 2024-01-04, 11:43:45, 20 procs): pss=9.6GB uss=7.4GB
MEMORY: total (main 2544357, 2024-01-04, 11:43:45, 20 procs): pss=9.7GB uss=7.3GB
MEMORY: total (main 2544354, 2024-01-04, 11:43:45, 20 procs): pss=9.7GB uss=7.3GB
MEMORY: total (main 2544356, 2024-01-04, 11:43:45, 20 procs): pss=9.7GB uss=7.3GB
ep 1 train, step 99, acc 0.074, loss 335.256, loss_att 330.305, loss_ctc 346.806, total 8.466, mem_usage:cuda:1 7.4GB, 10.476 sec/step
ep 1 train, step 99, acc 0.253, loss 384.275, loss_att 394.287, loss_ctc 360.912, total 9.021, mem_usage:cuda:3 7.4GB, 10.336 sec/step
ep 1 train, step 99, acc 0.155, loss 386.506, loss_att 389.384, loss_ctc 379.793, total 8.744, mem_usage:cuda:0 7.4GB, 10.570 sec/step
ep 1 train, step 99, acc 0.163, loss 354.311, loss_att 357.119, loss_ctc 347.757, total 8.684, mem_usage:cuda:2 7.4GB, 10.487 sec/step
ep 1 train, step 100, acc 0.176, loss 366.819, loss_att 370.840, loss_ctc 357.436, total 8.818, mem_usage:cuda:0 7.4GB, 0.523 sec/step
ep 1 train, step 100, acc 0.212, loss 355.989, loss_att 363.577, loss_ctc 338.282, total 8.990, mem_usage:cuda:1 7.4GB, 0.699 sec/step

Next step:

ep 1 train, step 197, acc 0.175, loss 294.944, loss_att 291.001, loss_ctc 304.147, total 7.941, mem_usage:cuda:1 7.4GB, 0.691 sec/step
MEMORY: sub proc watch memory(2544487) increased RSS: rss=53.3MB pss=32.0MB uss=31.8MB shared=21.5MB
ep 1 train, step 198, acc 0.110, loss 305.260, loss_att 302.182, loss_ctc 312.441, total 7.794, mem_usage:cuda:2 7.4GB, 0.621 sec/step
ep 1 train, step 198, acc 0.184, loss 269.757, loss_att 268.161, loss_ctc 273.482, total 7.868, mem_usage:cuda:3 7.4GB, 0.681 sec/step
ep 1 train, step 198, acc 0.233, loss 319.210, loss_att 314.436, loss_ctc 330.348, total 7.849, mem_usage:cuda:0 7.4GB, 0.586 sec/step
ep 1 train, step 198, acc 0.202, loss 272.608, loss_att 269.419, loss_ctc 280.047, total 7.940, mem_usage:cuda:1 7.4GB, 0.599 sec/step
MEMORY: total (main 2544354, 2024-01-04, 11:44:53, 20 procs): pss=9.7GB uss=7.3GB
MEMORY: main proc python3.11(2544357) increased RSS: rss=5.2GB pss=3.6GB uss=2.7GB shared=2.4GB
MEMORY: total (main 2544357, 2024-01-04, 11:44:58, 20 procs): pss=9.8GB uss=7.5GB
ep 1 train, step 199, acc 0.149, loss 307.781, loss_att 303.996, loss_ctc 316.611, total 7.858, mem_usage:cuda:1 7.4GB, 10.062 sec/step
ep 1 train, step 199, acc 0.252, loss 293.949, loss_att 289.728, loss_ctc 303.798, total 8.017, mem_usage:cuda:2 7.4GB, 10.186 sec/step
ep 1 train, step 199, acc 0.180, loss 328.601, loss_att 325.397, loss_ctc 336.078, total 7.855, mem_usage:cuda:0 7.4GB, 10.115 sec/step

Next step:

ep 1 train, step 298, acc 0.236, loss 242.444, loss_att 237.714, loss_ctc 253.481, total 7.384, mem_usage:cuda:0 7.4GB, 0.619 sec/step
ep 1 train, step 298, acc 0.139, loss 226.474, loss_att 220.994, loss_ctc 239.262, total 7.306, mem_usage:cuda:2 7.4GB, 0.623 sec/step
ep 1 train, step 298, acc 0.140, loss 302.063, loss_att 296.391, loss_ctc 315.296, total 7.291, mem_usage:cuda:1 7.4GB, 0.649 sec/step
MEMORY: main proc python3.11(2544356) increased RSS: rss=5.2GB pss=3.7GB uss=2.8GB shared=2.4GB
MEMORY: total (main 2544356, 2024-01-04, 11:46:12, 20 procs): pss=9.9GB uss=7.6GB
MEMORY: main proc python3.11(2544355) increased RSS: rss=5.2GB pss=3.6GB uss=2.7GB shared=2.4GB
MEMORY: main proc python3.11(2544354) increased RSS: rss=5.2GB pss=3.6GB uss=2.7GB shared=2.4GB
MEMORY: total (main 2544355, 2024-01-04, 11:46:13, 20 procs): pss=9.8GB uss=7.6GB
MEMORY: total (main 2544354, 2024-01-04, 11:46:13, 20 procs): pss=9.8GB uss=7.5GB
ep 1 train, step 299, acc 0.231, loss 226.223, loss_att 219.746, loss_ctc 241.334, total 7.541, mem_usage:cuda:0 7.4GB, 9.971 sec/step
ep 1 train, step 299, acc 0.244, loss 214.304, loss_att 209.399, loss_ctc 225.749, total 7.463, mem_usage:cuda:2 7.4GB, 9.917 sec/step
ep 1 train, step 299, acc 0.219, loss 246.367, loss_att 242.469, loss_ctc 255.464, total 7.332, mem_usage:cuda:3 7.4GB, 10.173 sec/step
ep 1 train, step 299, acc 0.205, loss 258.888, loss_att 252.792, loss_ctc 273.113, total 7.313, mem_usage:cuda:1 7.4GB, 9.980 sec/step
ep 1 train, step 300, acc 0.167, loss 249.398, loss_att 242.883, loss_ctc 264.598, total 7.335, mem_usage:cuda:0 7.4GB, 0.706 sec/step

Some later step:

ep 72 train, step 198, acc 0.770, loss 61.103, loss_att 54.528, loss_ctc 76.445, total 1.540, mem_usage:cuda:1 7.6GB, 0.647 sec/step
ep 72 train, step 198, acc 0.804, loss 54.507, loss_att 47.772, loss_ctc 70.224, total 1.363, mem_usage:cuda:0 7.6GB, 0.587 sec/step
ep 72 train, step 198, acc 0.705, loss 84.371, loss_att 75.461, loss_ctc 105.162, total 1.932, mem_usage:cuda:2 7.6GB, 0.626 sec/step
MEMORY: main proc python3.11(2544355) increased RSS: rss=5.8GB pss=3.7GB uss=2.9GB shared=2.9GB
MEMORY: total (main 2544355, 2024-01-06, 02:46:02, 22 procs): pss=15.7GB uss=10.4GB
ep 72 train, step 199, acc 0.784, loss 54.270, loss_att 47.960, loss_ctc 68.992, total 1.243, mem_usage:cuda:1 7.6GB, 9.648 sec/step
ep 72 train, step 199, acc 0.789, loss 61.887, loss_att 57.086, loss_ctc 73.089, total 1.417, mem_usage:cuda:0 7.6GB, 9.584 sec/step
ep 72 train, step 199, acc 0.773, loss 47.976, loss_att 44.084, loss_ctc 57.057, total 1.182, mem_usage:cuda:3 7.6GB, 9.757 sec/step
ep 72 train, step 199, acc 0.800, loss 67.931, loss_att 59.203, loss_ctc 88.297, total 1.421, mem_usage:cuda:2 7.6GB, 9.630 sec/step

So, it looks a bit like the param sync causes a memory leak. But not sure. Also, this doesn't look like it is an issue on RETURNN side, as the code is very trivial.

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-06T11:11:10Z

I reported that upstream here: pytorch/pytorch#116923

albertz mentioned this issue Jan 6, 2024

Memory leak in all_reduce on CPU tensors with Gloo pytorch/pytorch#116923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch distributed training CPU OOM with sync_on_cpu #1490

PyTorch distributed training CPU OOM with sync_on_cpu #1490

albertz commented Jan 6, 2024 •

edited

albertz commented Jan 6, 2024

PyTorch distributed training CPU OOM with sync_on_cpu #1490

PyTorch distributed training CPU OOM with sync_on_cpu #1490

Comments

albertz commented Jan 6, 2024 • edited

albertz commented Jan 6, 2024

albertz commented Jan 6, 2024 •

edited