You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've just tried using RichProgressBar and noticed that my training scripts started to hang in from time to time randomly in the middle of training. I saw that there was an issue with RichProgressBar deadlocking during distributed training (#10362), and I'm seeing the same behavior with non-distributed, single GPU training.
To Reproduce
The deadlock is tricky to reproduce using a variation of a BoringModel in colab and I won't be able to share my code as is. However, here's a relevant part of stacktrace generated by stacktracer.py utility from here for the actual code that runs into a deadlock:
# ThreadID: 140057496631104
File: "dev/main.py", line 25, in main
runner.train_and_evaluate(cfg)
File: "/home/maruan/dev/runner.py", line 80, in train_and_evaluate
trainer.fit(model, benchmark)
File: "/home/maruan/dev/components/trainers.py", line 92, in fit
pytorch_lightning.Trainer.fit(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
self.training_type_plugin.start_training(self)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
return self._run_train()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
self.fit_loop.run()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 137, in on_run_start
self.trainer.call_hook("on_train_epoch_start")
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1491, in call_hook
callback_fx(*args, **kwargs)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 88, in on_train_epoch_start
callback.on_train_epoch_start(self, self.lightning_module)
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 346, in on_train_epoch_start
self._stop_progress()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 424, in _stop_progress
self.progress.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/progress.py", line 647, in stop
self.live.stop()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/site-packages/rich/live.py", line 136, in stop
self._refresh_thread.join()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File: "/home/maruan/.conda/envs/main-env/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
Expected behavior
The training should complete and not hang randomly in the middle.
Most of my setup is plain lightning, nothing custom. The code runs into a deadlock typically during validation. I'm training multitask models, so using a dictionary of training dataloaders and CombinedLoader for validation and testing (would this matter here?). Everything works fine with the standard tqdm-based progress bar.
a colleague of mine discovered that the issue was actually with rich==10.15.1, which had a deadlock (Textualize/rich#1734). it was fixed in rich==10.15.2. I recommend setting rich requirement to >=10.15.2.
馃悰 Bug
I've just tried using RichProgressBar and noticed that my training scripts started to hang in from time to time randomly in the middle of training. I saw that there was an issue with RichProgressBar deadlocking during distributed training (#10362), and I'm seeing the same behavior with non-distributed, single GPU training.
To Reproduce
The deadlock is tricky to reproduce using a variation of a BoringModel in colab and I won't be able to share my code as is. However, here's a relevant part of stacktrace generated by
stacktracer.py
utility from here for the actual code that runs into a deadlock:Expected behavior
The training should complete and not hang randomly in the middle.
Environment
Additional context
Most of my setup is plain lightning, nothing custom. The code runs into a deadlock typically during validation. I'm training multitask models, so using a dictionary of training dataloaders and
CombinedLoader
for validation and testing (would this matter here?). Everything works fine with the standard tqdm-based progress bar.cc @tchaton @SeanNaren @kaushikb11
The text was updated successfully, but these errors were encountered: