Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lm harness distributed evaluation? #930

Open
monk1337 opened this issue May 3, 2024 · 2 comments
Open

lm harness distributed evaluation? #930

monk1337 opened this issue May 3, 2024 · 2 comments

Comments

@monk1337
Copy link

monk1337 commented May 3, 2024

I am trying to eval the finetuned model 70B with torch run and getting error

Here is my config file

model:
  _component_: torchtune.models.llama3.lora_llama3_70b
  lora_attn_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj']
  apply_lora_to_mlp: True
  apply_lora_to_output: True
  lora_rank: 256
  lora_alpha: 512

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-70B-Instruct/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir:  /tmp/Meta-Llama-3-70B-Instruct
  checkpoint_files: [
    model-00001-of-00030.safetensors,
    model-00002-of-00030.safetensors,
    model-00003-of-00030.safetensors,
    model-00004-of-00030.safetensors,
    model-00005-of-00030.safetensors,
    model-00006-of-00030.safetensors,
    model-00007-of-00030.safetensors,
    model-00008-of-00030.safetensors,
    model-00009-of-00030.safetensors,
    model-00010-of-00030.safetensors,
    model-00011-of-00030.safetensors,
    model-00012-of-00030.safetensors,
    model-00013-of-00030.safetensors,
    model-00014-of-00030.safetensors,
    model-00015-of-00030.safetensors,
    model-00016-of-00030.safetensors,
    model-00017-of-00030.safetensors,
    model-00018-of-00030.safetensors,
    model-00019-of-00030.safetensors,
    model-00020-of-00030.safetensors,
    model-00021-of-00030.safetensors,
    model-00022-of-00030.safetensors,
    model-00023-of-00030.safetensors,
    model-00024-of-00030.safetensors,
    model-00025-of-00030.safetensors,
    model-00026-of-00030.safetensors,
    model-00027-of-00030.safetensors,
    model-00028-of-00030.safetensors,
    model-00029-of-00030.safetensors,
    model-00030-of-00030.safetensors,
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-70B-Instruct
  model_type: LLAMA3
resume_from_checkpoint: False

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  source: personal_data/data
  train_on_input: False
  max_seq_len: 8000
seed: 42
shuffle: True
batch_size: 10

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01
  lr: 2e-4
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

loss:
  _component_: torch.nn.CrossEntropyLoss

# Training
epochs: 10
max_steps_per_epoch: null
gradient_accumulation_steps: 32
compile: False

# Logging
output_dir: /tmp/lora_finetune_output
metric_logger:
  _component_: torchtune.utils.metric_logging.WandBLogger
  project: torchtune
log_every_n_steps: 1
log_peak_memory_stats: False

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: True

when running with this command

tune run eleuther_eval --config evalconfig.yml

getting this error

""" File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parameter.py", line 59, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"""

when tried with

tune run --nproc_per_node 8 eleuther_eval --config evalconfig.yml

it's giving another error

tune run: error: Recipe eleuther_eval does not support distributed training.Please run without torchrun commands.

How to evaluate large models with torchtune?

@joecummings
Copy link
Contributor

This is something we're working closely with the EleutherAI team on providing soon. For now, if you have enough RAM (and patience) you can try running on CPU - this will likely take a looooong time. You can also try using the accelerate library for now by following the instructions here: https://github.com/EleutherAI/lm-evaluation-harness#multi-gpu-evaluation-with-hugging-face-accelerate.

Stay tuned for a torchtune native multi-GPU evaluation feature soon!

@monk1337
Copy link
Author

monk1337 commented May 3, 2024

Awesome, and thank you for the reply! I am excited about the new feature, but in the meantime, I want to try the native LM harness. However, to do that, I need to convert TorchTune weights into HF weights. I am having issues with the conversion for the 70B model, so I have opened another issue for that. Please take a look when you have a chance. #922

I am a heavy user of Axolotl and TRL but am now switching to TorchTune. I anticipate encountering some bugs during this transition, so I will be opening issues as I come across them. :)
Additionally, I would be happy to contribute in any way that I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants