Existing metric keys not moved to device after LearningRateFinder #19813

clumsy · 2024-04-25T14:31:17Z

Bug description

Running LearningRateFinder leads to teardown() on training epoch loop's results being moved to "cpu" here.

The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for cumulated_batch_size reduction which used the device of the original value tensor when it was first created. So when it's still on cpu when the training starts for real after lr_find we face RuntimeError('No backend type associated with device type cpu').

E.g. the issue happens when using 2 GPU device (see logs below).

I'll submit a fix for review shortly.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

train/0 [1]:-> s.trainer.fit(s.model, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(543)fit()
train/0 [1]:-> call._call_and_handle_interrupt(
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py(43)_call_and_handle_interrupt()
train/0 [1]:-> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py(105)launch()
train/0 [1]:-> return function(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(579)_fit_impl()
train/0 [1]:-> self._run(model, ckpt_path=ckpt_path)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(986)_run()
train/0 [1]:-> results = self._run_stage()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(1032)_run_stage()
train/0 [1]:-> self.fit_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(205)run()
train/0 [1]:-> self.advance()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(363)advance()
train/0 [1]:-> self.epoch_loop.run(self._data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(139)run()
train/0 [1]:-> self.on_advance_end(data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(287)on_advance_end()
train/0 [1]:-> self.val_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py(182)_decorator()
train/0 [1]:-> return loop_run(self, *args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(142)run()
train/0 [1]:-> return self.on_run_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(254)on_run_end()
train/0 [1]:-> self._on_evaluation_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(336)_on_evaluation_epoch_end()
train/0 [1]:-> trainer._logger_connector.on_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(195)on_epoch_end()
train/0 [1]:-> metrics = self.metrics
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(234)metrics()
train/0 [1]:-> return self.trainer._results.metrics(on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(483)metrics()
train/0 [1]:-> value = self._get_cache(result_metric, on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(447)_get_cache()
train/0 [1]:-> result_metric.compute()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(289)wrapped_func()
train/0 [1]:-> self._computed = compute(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(251)compute()
train/0 [1]:-> cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py(342)reduce()
train/0 [1]:-> return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(172)_sync_ddp_if_available()
train/0 [1]:-> return _sync_ddp(result, group=group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(222)_sync_ddp()
train/0 [1]:-> torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py(72)wrapper()
train/0 [1]:-> return func(*args, **kwargs)
train/0 [1]:> /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py(1996)all_reduce()
train/0 [0]:RuntimeError('No backend type associated with device type cpu')

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

The text was updated successfully, but these errors were encountered:

clumsy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 25, 2024

github-actions bot added the ver: 2.2.x label Apr 25, 2024

clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024

fix: move results' keys to device (Lightning-AI#19813)

99dcf93

clumsy linked a pull request Apr 25, 2024 that will close this issue

fix: move results' keys to device (#19813) #19814

Open

11 tasks

clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024

fix: move results' keys to device (Lightning-AI#19813)

3db5c52

clumsy pushed a commit to clumsy/pytorch-lightning that referenced this issue Apr 25, 2024

fix: move results' keys to device (Lightning-AI#19813)

8085e74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Existing metric keys not moved to device after LearningRateFinder #19813

Existing metric keys not moved to device after LearningRateFinder #19813

clumsy commented Apr 25, 2024

Existing metric keys not moved to device after LearningRateFinder #19813

Existing metric keys not moved to device after LearningRateFinder #19813

Comments

clumsy commented Apr 25, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info