You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
""" File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parameter.py", line 59, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"""
when tried with
tune run --nproc_per_node 8 eleuther_eval --config evalconfig.yml
it's giving another error
tune run: error: Recipe eleuther_eval does not support distributed training.Please run without torchrun commands.
How to evaluate large models with torchtune?
The text was updated successfully, but these errors were encountered:
Awesome, and thank you for the reply! I am excited about the new feature, but in the meantime, I want to try the native LM harness. However, to do that, I need to convert TorchTune weights into HF weights. I am having issues with the conversion for the 70B model, so I have opened another issue for that. Please take a look when you have a chance. #922
I am a heavy user of Axolotl and TRL but am now switching to TorchTune. I anticipate encountering some bugs during this transition, so I will be opening issues as I come across them. :)
Additionally, I would be happy to contribute in any way that I can.
I am trying to eval the finetuned model 70B with torch run and getting error
Here is my config file
when running with this command
tune run eleuther_eval --config evalconfig.yml
getting this error
""" File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parameter.py", line 59, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"""
when tried with
tune run --nproc_per_node 8 eleuther_eval --config evalconfig.yml
it's giving another error
tune run: error: Recipe eleuther_eval does not support distributed training.Please run without torchrun commands.
How to evaluate large models with torchtune?
The text was updated successfully, but these errors were encountered: