RichProgressBar deadlocking during non-distributed training #11034

alshedivat · 2021-12-11T01:49:06Z

🐛 Bug

I've just tried using RichProgressBar and noticed that my training scripts started to hang in from time to time randomly in the middle of training. I saw that there was an issue with RichProgressBar deadlocking during distributed training (#10362), and I'm seeing the same behavior with non-distributed, single GPU training.

To Reproduce

The deadlock is tricky to reproduce using a variation of a BoringModel in colab and I won't be able to share my code as is. However, here's a relevant part of stacktrace generated by stacktracer.py utility from here for the actual code that runs into a deadlock:

# ThreadID: 140057496631104
File: "dev/main.py", line 25, in main
  runner.train_and_evaluate(cfg)
File: "/home/maruan/dev/runner.py", line 80, in train_and_evaluate
  trainer.fit(model, benchmark)
File: "/home/maruan/dev/components/trainers.py", line 92, in fit
  pytorch_lightning.Trainer.fit(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
  self._call_and_handle_interrupt(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
  return trainer_fn(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
  self._run(model, ckpt_path=ckpt_path)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
  self._dispatch()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
  self.training_type_plugin.start_training(self)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
  self._results = trainer.run_stage()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
  return self._run_train()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
  self.fit_loop.run()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
  self.epoch_loop.run(data_fetcher)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 140, in run
  self.on_run_start(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 137, in on_run_start
  self.trainer.call_hook("on_train_epoch_start")
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1491, in call_hook
  callback_fx(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 88, in on_train_epoch_start
  callback.on_train_epoch_start(self, self.lightning_module)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 346, in on_train_epoch_start
  self._stop_progress()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 424, in _stop_progress
  self.progress.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/progress.py", line 647, in stop
  self.live.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/live.py", line 136, in stop
  self._refresh_thread.join()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1011, in join
  self._wait_for_tstate_lock()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
  elif lock.acquire(block, timeout):

Expected behavior

The training should complete and not hang randomly in the middle.

Environment

* CUDA:
        - GPU:
                - NVIDIA Tesla T4
                - NVIDIA Tesla T4
        - available:         True
        - version:           11.1
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.9.0
        - pytorch-lightning: 1.5.5
        - tqdm:              4.50.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.4
        - version:           #61~18.04.1-Ubuntu SMP Thu Oct 28 04:16:28 UTC 2021

Additional context

Most of my setup is plain lightning, nothing custom. The code runs into a deadlock typically during validation. I'm training multitask models, so using a dictionary of training dataloaders and CombinedLoader for validation and testing (would this matter here?). Everything works fine with the standard tqdm-based progress bar.

cc @tchaton @SeanNaren @kaushikb11

The text was updated successfully, but these errors were encountered:

carmocca · 2022-03-01T13:34:31Z

Thanks for the issue! That stacktrace is great.

@kaushikb11 This might just be the cause of the CI hangs

alshedivat · 2022-03-10T18:58:25Z

a colleague of mine discovered that the issue was actually with rich==10.15.1, which had a deadlock (Textualize/rich#1734). it was fixed in rich==10.15.2. I recommend setting rich requirement to >=10.15.2.

akihironitta · 2022-03-10T19:10:57Z

@alshedivat Thanks a lot for getting back to this issue!

We should avoid 10.15.0 and 10.15.1 as mentioned in Textualize/rich#1734.

akihironitta · 2022-03-10T19:23:46Z

Thanks for the issue! That stacktrace is great.

@kaushikb11 This might just be the cause of the CI hangs

@carmocca @kaushikb11 Which of the CI jobs have you been seeing hangs?

carmocca · 2022-03-13T13:59:11Z

I'm not sure. This was a while back when we tried to make this progress bar the default implementation used.

alshedivat added the bug Something isn't working label Dec 11, 2021

awaelchli added priority: 0 High priority task progress bar: rich labels Dec 17, 2021

awaelchli added this to the 1.5.x milestone Dec 17, 2021

carmocca assigned kaushikb11 Mar 1, 2022

akihironitta mentioned this issue Mar 10, 2022

Avoid rich 10.15.0 and 10.15.1 #12293

Merged

10 tasks

krshrimali closed this as completed in #12293 Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RichProgressBar deadlocking during non-distributed training #11034

RichProgressBar deadlocking during non-distributed training #11034

alshedivat commented Dec 11, 2021 •

edited by github-actions bot

carmocca commented Mar 1, 2022

alshedivat commented Mar 10, 2022

akihironitta commented Mar 10, 2022

akihironitta commented Mar 10, 2022

carmocca commented Mar 13, 2022

RichProgressBar deadlocking during non-distributed training #11034

RichProgressBar deadlocking during non-distributed training #11034

Comments

alshedivat commented Dec 11, 2021 • edited by github-actions bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

carmocca commented Mar 1, 2022

alshedivat commented Mar 10, 2022

akihironitta commented Mar 10, 2022

akihironitta commented Mar 10, 2022

carmocca commented Mar 13, 2022

alshedivat commented Dec 11, 2021 •

edited by github-actions bot