Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How I can find all the checkpoints and merge it manually? (Lora) #922

Open
monk1337 opened this issue May 2, 2024 · 3 comments
Open

How I can find all the checkpoints and merge it manually? (Lora) #922

monk1337 opened this issue May 2, 2024 · 3 comments

Comments

@monk1337
Copy link

monk1337 commented May 2, 2024

Great job guys for this awesome tool. I have just started using this and loving it already. I have one question, I am fine-tuning for 6 epochs and want to store each checkpoint separately. Later I would like to evaluate each checkpoint, how can I do that?

@ebsmothers
Copy link
Contributor

Hi @monk1337 thanks for the issue! Glad to hear you're finding the library useful. To clarify, are you interested in storing just the LoRA weights from the end of each epoch so that you can compare evaluations across different epochs?

To give a bit more info.. we will output two checkpoints at the end of each epoch to your output directory: for epoch i these would be adapter_i.pt and {prefix}_model_i.pt (the value of prefix depends on which checkpoint format you're loading in). adapter_i.pt is a smaller checkpoint containing only the LoRA weights, while {prefix}_model_i.pt will contain the LoRA weights merged back into the original model. So if you want to evaluate how your fine-tuned checkpoints are doing after each epoch you can use the latter, as it will contain updated versions of the params of the original model based on your fine-tune.

For evaluation, we also have an integration with EleutherAI's eval harness so you are welcome to use that if you like. If you want more details on how to do this you can check out this section of our end-to-end tutorial.

Let me know if this makes sense or if there's something else you're looking for here, happy to address any follow-ups you may have.

@monk1337
Copy link
Author

monk1337 commented May 3, 2024

@ebsmothers Thank you for your detailed reply. I have one follow-up question, How can I convert this merge model folder which contains multiple {prefix}_model_i.pt to huggingface format and upload it? I am trying to use the native lm harness but it's giving errors due to the pt format

  1. how to convert the .pt format of torch tune to hf so I can use other tools easily
  2. How to use distributed GPU during inference time of lm harness using torchtune?

@ebsmothers
Copy link
Contributor

Hi @monk1337 this is a good question. The format we output should generally adhere to the same format as the inputs (i.e. the logic for distributing weights across files should line up exactly). So in this case the format should still match HF.

The main difference would be usage of safetensors (as you pointed out, we do write out to .pt format). Can you share a stack trace on (1) so I can see where it's coming from? I'll need to figure out if the issue is that we aren't writing out to safetensors format, or if there's something else happening here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants