Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RichProgressBar deadlocking during non-distributed training #11034

Closed
alshedivat opened this issue Dec 11, 2021 · 5 comments 路 Fixed by #12293
Closed

RichProgressBar deadlocking during non-distributed training #11034

alshedivat opened this issue Dec 11, 2021 · 5 comments 路 Fixed by #12293
Assignees
Labels
bug Something isn't working priority: 0 High priority task progress bar: rich
Milestone

Comments

@alshedivat
Copy link

alshedivat commented Dec 11, 2021

馃悰 Bug

I've just tried using RichProgressBar and noticed that my training scripts started to hang in from time to time randomly in the middle of training. I saw that there was an issue with RichProgressBar deadlocking during distributed training (#10362), and I'm seeing the same behavior with non-distributed, single GPU training.

To Reproduce

The deadlock is tricky to reproduce using a variation of a BoringModel in colab and I won't be able to share my code as is. However, here's a relevant part of stacktrace generated by stacktracer.py utility from here for the actual code that runs into a deadlock:

# ThreadID: 140057496631104
File: "dev/main.py", line 25, in main
  runner.train_and_evaluate(cfg)
File: "/home/maruan/dev/runner.py", line 80, in train_and_evaluate
  trainer.fit(model, benchmark)
File: "/home/maruan/dev/components/trainers.py", line 92, in fit
  pytorch_lightning.Trainer.fit(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
  self._call_and_handle_interrupt(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
  return trainer_fn(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
  self._run(model, ckpt_path=ckpt_path)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
  self._dispatch()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
  self.training_type_plugin.start_training(self)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
  self._results = trainer.run_stage()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
  return self._run_train()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
  self.fit_loop.run()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
  self.epoch_loop.run(data_fetcher)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 140, in run
  self.on_run_start(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 137, in on_run_start
  self.trainer.call_hook("on_train_epoch_start")
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1491, in call_hook
  callback_fx(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 88, in on_train_epoch_start
  callback.on_train_epoch_start(self, self.lightning_module)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 346, in on_train_epoch_start
  self._stop_progress()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 424, in _stop_progress
  self.progress.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/progress.py", line 647, in stop
  self.live.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/live.py", line 136, in stop
  self._refresh_thread.join()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1011, in join
  self._wait_for_tstate_lock()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
  elif lock.acquire(block, timeout):

Expected behavior

The training should complete and not hang randomly in the middle.

Environment

* CUDA:
        - GPU:
                - NVIDIA Tesla T4
                - NVIDIA Tesla T4
        - available:         True
        - version:           11.1
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.9.0
        - pytorch-lightning: 1.5.5
        - tqdm:              4.50.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.4
        - version:           #61~18.04.1-Ubuntu SMP Thu Oct 28 04:16:28 UTC 2021

Additional context

Most of my setup is plain lightning, nothing custom. The code runs into a deadlock typically during validation. I'm training multitask models, so using a dictionary of training dataloaders and CombinedLoader for validation and testing (would this matter here?). Everything works fine with the standard tqdm-based progress bar.

cc @tchaton @SeanNaren @kaushikb11

@alshedivat alshedivat added the bug Something isn't working label Dec 11, 2021
@awaelchli awaelchli added this to the 1.5.x milestone Dec 17, 2021
@carmocca
Copy link
Member

carmocca commented Mar 1, 2022

Thanks for the issue! That stacktrace is great.

@kaushikb11 This might just be the cause of the CI hangs

@alshedivat
Copy link
Author

a colleague of mine discovered that the issue was actually with rich==10.15.1, which had a deadlock (Textualize/rich#1734). it was fixed in rich==10.15.2. I recommend setting rich requirement to >=10.15.2.

@akihironitta
Copy link
Contributor

@alshedivat Thanks a lot for getting back to this issue!

We should avoid 10.15.0 and 10.15.1 as mentioned in Textualize/rich#1734.

@akihironitta
Copy link
Contributor

Thanks for the issue! That stacktrace is great.

@kaushikb11 This might just be the cause of the CI hangs

@carmocca @kaushikb11 Which of the CI jobs have you been seeing hangs?

@carmocca
Copy link
Member

I'm not sure. This was a while back when we tried to make this progress bar the default implementation used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task progress bar: rich
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants