You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we try to perform data parallelism using mlrun.frameworks.pytorch.train(cuda=True), the training gets stopped when it goes to mlrun_interface.py file. It stops the iteration stating that the loss_function which we are sharing is None. Normally in Pytorch for model training we will pass the loss function with model weights, but here in mlrun we are passing loss function without weights as mentioned below mlrun.frameworks.pytorch.train(loss=torch.nn.CrossEntropyLoss()). If we try to pass model weights in the loss function, then getting an error that we cannot pass weights. When we try to pass the loss function without weights by setting the condition as cuda=False(in CPU), distributed training is perfectly working but it is not working only when cuda=True. Can you please help us how to resolve this error.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When we try to perform data parallelism using mlrun.frameworks.pytorch.train(cuda=True), the training gets stopped when it goes to mlrun_interface.py file. It stops the iteration stating that the loss_function which we are sharing is None. Normally in Pytorch for model training we will pass the loss function with model weights, but here in mlrun we are passing loss function without weights as mentioned below mlrun.frameworks.pytorch.train(loss=torch.nn.CrossEntropyLoss()). If we try to pass model weights in the loss function, then getting an error that we cannot pass weights. When we try to pass the loss function without weights by setting the condition as cuda=False(in CPU), distributed training is perfectly working but it is not working only when cuda=True. Can you please help us how to resolve this error.
Beta Was this translation helpful? Give feedback.
All reactions