You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.
Pitch
Add a wrap property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.
Alternatives
Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to epoch within the dataset.
The text was updated successfully, but these errors were encountered:
馃殌 Feature
Consider adding the ability to wrap around a StreamingDataset without issuing a StopIteration when combining datasets.
This is something we haven't ported from PackedDataset https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py#L190
Motivation
This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.
Pitch
Add a
wrap
property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.Alternatives
Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to
epoch
within the dataset.The text was updated successfully, but these errors were encountered: