litdata with huggingface instead of S3 #64

ehartford · 2024-03-08T07:15:04Z

🚀 Feature

I wanna use litdata to stream huggingface dataset cerebras/SlimPajama-627B. (not S3)

Motivation

How can I stream huggingface dataset instead of S3

Pitch

I wanna stream huggingface dataset not S3

Alternatives

just to stream huggingface dataset instead of S3

Additional context

I wanna use huggingface dataset, not S3

github-actions · 2024-03-08T07:15:31Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-03-08T08:22:25Z

Hey @ehartford. I have already prepared a version of SlimPajama. It is ready to use on the platform.

tchaton · 2024-03-08T09:01:13Z

Here is the code:

from litdata import StreamingDataset, CombinedStreamingDataset
from litdata.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader

train_datasets = [
    StreamingDataset(
        input_dir="s3://tinyllama-template/slimpajama/train/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
    StreamingDataset(
        input_dir="s3://tinyllama-template/starcoder/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
]

# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)

train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
    pass

ehartford · 2024-03-08T09:48:56Z

Ok but, is it better to support hugging face instead of having to copy the dataset to s3? Aws charges for ingress and egress

Borda · 2024-03-08T10:15:20Z

Ok but, is it better to support hugging face instead of having to copy the dataset to s3?

we used to have some issues with the stability and reachability of HF models and datasets in the past so I may say that S3 is a more reliable alternative...

tchaton · 2024-03-08T13:14:34Z

Hey @ehartford. In order to stream datasets, we need to optimize the dataset first. We could have an auto-optimize version for the HF datasets, but it would still require to download the dataset and convert it.

HF supports some streaming with webdataset backend but I gave up on it as it was too un-reliable for anything serious. The pipe breaks, it doesn't support multi node, etc...

If you are interested in using any particular dataset, I recommend trying out the Lightning AI platform.

Here is an example where I prepare Wikipedia Swedish: https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles

And another one were I prepared SlimPajama & StarCoder: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset.

Don't hesitate to ask any other questions :)

ehartford added enhancement New feature or request help wanted Extra attention is needed labels Mar 8, 2024

Borda removed the help wanted Extra attention is needed label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

litdata with huggingface instead of S3 #64

litdata with huggingface instead of S3 #64

ehartford commented Mar 8, 2024

github-actions bot commented Mar 8, 2024

tchaton commented Mar 8, 2024 •

edited

tchaton commented Mar 8, 2024 •

edited

ehartford commented Mar 8, 2024

Borda commented Mar 8, 2024

tchaton commented Mar 8, 2024 •

edited

litdata with huggingface instead of S3 #64

litdata with huggingface instead of S3 #64

Comments

ehartford commented Mar 8, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Mar 8, 2024

tchaton commented Mar 8, 2024 • edited

tchaton commented Mar 8, 2024 • edited

ehartford commented Mar 8, 2024

Borda commented Mar 8, 2024

tchaton commented Mar 8, 2024 • edited

tchaton commented Mar 8, 2024 •

edited

tchaton commented Mar 8, 2024 •

edited

tchaton commented Mar 8, 2024 •

edited