[WIP] Simplified preparation of pretraining datasets #1057

awaelchli · 2024-03-07T23:35:53Z

The idea is that data modules that expose prepare_data can be called in advance to prepare data. For in-memory datasets (e.g. finetuning) this is a no-op and not required. But for pretraining datasets (terrabytes), this is very useful as it can be scaled to a large cluster with a single command:

litgpt prepare --data TinyLlama --tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf

carmocca · 2024-03-14T16:37:06Z

This is blocked by not being able to run two optimize calls together. Maybe we should have tutorials suggest python -m litgpt.data.prepare_* in the meantime for people who use this externally.

awaelchli added the enhancement New feature or request label Mar 7, 2024

carmocca added this to the Configurability milestone Mar 13, 2024

carmocca removed this from the Configurability milestone Mar 14, 2024

awaelchli added 4 commits April 8, 2024 10:30

entrypoint

718d294

prepare

7835e16

fixes

bf69b51

update

f9099f3

awaelchli force-pushed the refactor/prepare_data branch from ade1c1b to f9099f3 Compare April 8, 2024 09:32

awaelchli changed the base branch from wip to main April 8, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Simplified preparation of pretraining datasets #1057

[WIP] Simplified preparation of pretraining datasets #1057

awaelchli commented Mar 7, 2024 •

edited

carmocca commented Mar 14, 2024

[WIP] Simplified preparation of pretraining datasets #1057

Are you sure you want to change the base?

[WIP] Simplified preparation of pretraining datasets #1057

Conversation

awaelchli commented Mar 7, 2024 • edited

carmocca commented Mar 14, 2024

awaelchli commented Mar 7, 2024 •

edited