Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset #74

lantiga · 2024-03-14T21:52:07Z

🚀 Feature

Consider adding the ability to wrap around a StreamingDataset without issuing a StopIteration when combining datasets.

This is something we haven't ported from PackedDataset https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py#L190

Motivation

This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.

Pitch

Add a wrap property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.

Alternatives

Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to epoch within the dataset.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-14T21:52:32Z

Hi! thanks for your contribution!, great first issue!

lantiga added enhancement New feature or request help wanted Extra attention is needed labels Mar 14, 2024

Borda removed the help wanted Extra attention is needed label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset #74

Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset #74

lantiga commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset #74

Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset #74

Comments

lantiga commented Mar 14, 2024

🚀 Feature

Motivation

Pitch

Alternatives

github-actions bot commented Mar 14, 2024