You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running LearningRateFinder leads to teardown() on training epoch loop's results being moved to "cpu" here.
The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for cumulated_batch_size reduction which used the device of the original value tensor when it was first created. So when it's still on cpu when the training starts for real after lr_find we face RuntimeError('No backend type associated with device type cpu').
E.g. the issue happens when using 2 GPU device (see logs below).
Bug description
Running
LearningRateFinder
leads toteardown()
on training epoch loop's results being moved to "cpu" here.The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for
cumulated_batch_size
reduction which used the device of the originalvalue
tensor when it was first created. So when it's still oncpu
when the training starts for real afterlr_find
we faceRuntimeError('No backend type associated with device type cpu')
.E.g. the issue happens when using 2 GPU device (see logs below).
I'll submit a fix for review shortly.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: