How I can find all the checkpoints and merge it manually? (Lora) #922

monk1337 · 2024-05-02T18:38:59Z

Great job guys for this awesome tool. I have just started using this and loving it already. I have one question, I am fine-tuning for 6 epochs and want to store each checkpoint separately. Later I would like to evaluate each checkpoint, how can I do that?

ebsmothers · 2024-05-02T19:58:02Z

Hi @monk1337 thanks for the issue! Glad to hear you're finding the library useful. To clarify, are you interested in storing just the LoRA weights from the end of each epoch so that you can compare evaluations across different epochs?

To give a bit more info.. we will output two checkpoints at the end of each epoch to your output directory: for epoch i these would be adapter_i.pt and {prefix}_model_i.pt (the value of prefix depends on which checkpoint format you're loading in). adapter_i.pt is a smaller checkpoint containing only the LoRA weights, while {prefix}_model_i.pt will contain the LoRA weights merged back into the original model. So if you want to evaluate how your fine-tuned checkpoints are doing after each epoch you can use the latter, as it will contain updated versions of the params of the original model based on your fine-tune.

For evaluation, we also have an integration with EleutherAI's eval harness so you are welcome to use that if you like. If you want more details on how to do this you can check out this section of our end-to-end tutorial.

Let me know if this makes sense or if there's something else you're looking for here, happy to address any follow-ups you may have.

monk1337 · 2024-05-03T02:59:01Z

@ebsmothers Thank you for your detailed reply. I have one follow-up question, How can I convert this merge model folder which contains multiple {prefix}_model_i.pt to huggingface format and upload it? I am trying to use the native lm harness but it's giving errors due to the pt format

how to convert the .pt format of torch tune to hf so I can use other tools easily
How to use distributed GPU during inference time of lm harness using torchtune?

ebsmothers · 2024-05-03T15:08:47Z

Hi @monk1337 this is a good question. The format we output should generally adhere to the same format as the inputs (i.e. the logic for distributing weights across files should line up exactly). So in this case the format should still match HF.

The main difference would be usage of safetensors (as you pointed out, we do write out to .pt format). Can you share a stack trace on (1) so I can see where it's coming from? I'll need to figure out if the issue is that we aren't writing out to safetensors format, or if there's something else happening here.

monk1337 mentioned this issue May 3, 2024

lm harness distributed evaluation? #930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How I can find all the checkpoints and merge it manually? (Lora) #922

How I can find all the checkpoints and merge it manually? (Lora) #922

monk1337 commented May 2, 2024

ebsmothers commented May 2, 2024

monk1337 commented May 3, 2024 •

edited

ebsmothers commented May 3, 2024

How I can find all the checkpoints and merge it manually? (Lora) #922

How I can find all the checkpoints and merge it manually? (Lora) #922

Comments

monk1337 commented May 2, 2024

ebsmothers commented May 2, 2024

monk1337 commented May 3, 2024 • edited

ebsmothers commented May 3, 2024

monk1337 commented May 3, 2024 •

edited