Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing metric keys not moved to device after LearningRateFinder #19813

Open
clumsy opened this issue Apr 25, 2024 · 0 comments · May be fixed by #19814
Open

Existing metric keys not moved to device after LearningRateFinder #19813

clumsy opened this issue Apr 25, 2024 · 0 comments · May be fixed by #19814
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.2.x

Comments

@clumsy
Copy link
Contributor

clumsy commented Apr 25, 2024

Bug description

Running LearningRateFinder leads to teardown() on training epoch loop's results being moved to "cpu" here.

The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for cumulated_batch_size reduction which used the device of the original value tensor when it was first created. So when it's still on cpu when the training starts for real after lr_find we face RuntimeError('No backend type associated with device type cpu').

E.g. the issue happens when using 2 GPU device (see logs below).

I'll submit a fix for review shortly.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

train/0 [1]:-> s.trainer.fit(s.model, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(543)fit()
train/0 [1]:-> call._call_and_handle_interrupt(
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py(43)_call_and_handle_interrupt()
train/0 [1]:-> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py(105)launch()
train/0 [1]:-> return function(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(579)_fit_impl()
train/0 [1]:-> self._run(model, ckpt_path=ckpt_path)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(986)_run()
train/0 [1]:-> results = self._run_stage()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(1032)_run_stage()
train/0 [1]:-> self.fit_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(205)run()
train/0 [1]:-> self.advance()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(363)advance()
train/0 [1]:-> self.epoch_loop.run(self._data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(139)run()
train/0 [1]:-> self.on_advance_end(data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(287)on_advance_end()
train/0 [1]:-> self.val_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py(182)_decorator()
train/0 [1]:-> return loop_run(self, *args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(142)run()
train/0 [1]:-> return self.on_run_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(254)on_run_end()
train/0 [1]:-> self._on_evaluation_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(336)_on_evaluation_epoch_end()
train/0 [1]:-> trainer._logger_connector.on_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(195)on_epoch_end()
train/0 [1]:-> metrics = self.metrics
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(234)metrics()
train/0 [1]:-> return self.trainer._results.metrics(on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(483)metrics()
train/0 [1]:-> value = self._get_cache(result_metric, on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(447)_get_cache()
train/0 [1]:-> result_metric.compute()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(289)wrapped_func()
train/0 [1]:-> self._computed = compute(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(251)compute()
train/0 [1]:-> cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py(342)reduce()
train/0 [1]:-> return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(172)_sync_ddp_if_available()
train/0 [1]:-> return _sync_ddp(result, group=group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(222)_sync_ddp()
train/0 [1]:-> torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py(72)wrapper()
train/0 [1]:-> return func(*args, **kwargs)
train/0 [1]:> /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py(1996)all_reduce()
train/0 [0]:RuntimeError('No backend type associated with device type cpu')

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@clumsy clumsy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 25, 2024
clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024
@clumsy clumsy linked a pull request Apr 25, 2024 that will close this issue
11 tasks
clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024
clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.2.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant