Getting errors while performing data parallelism using Pytorch Framework #4661

hari2011 · 2023-11-29T05:08:39Z

hari2011
Nov 29, 2023

When we try to perform data parallelism using mlrun.frameworks.pytorch.train(cuda=True), the training gets stopped when it goes to mlrun_interface.py file. It stops the iteration stating that the loss_function which we are sharing is None. Normally in Pytorch for model training we will pass the loss function with model weights, but here in mlrun we are passing loss function without weights as mentioned below mlrun.frameworks.pytorch.train(loss=torch.nn.CrossEntropyLoss()). If we try to pass model weights in the loss function, then getting an error that we cannot pass weights. When we try to pass the loss function without weights by setting the condition as cuda=False(in CPU), distributed training is perfectly working but it is not working only when cuda=True. Can you please help us how to resolve this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting errors while performing data parallelism using Pytorch Framework #4661

{{title}}

Replies: 0 comments

Select a reply

Getting errors while performing data parallelism using Pytorch Framework #4661

hari2011 Nov 29, 2023

Replies: 0 comments

hari2011
Nov 29, 2023