New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oversampling strategy for iterable datasets in interleave_datasets
#4893
Comments
Hi @lhoestq, Also, setting
|
Great @ylacombe thanks ! I'm assigning you this issue |
Hi @ylacombe :) Is there anything I can do to help ? Feel free to ping me if you have any question :) |
Hi @lhoestq, I actually have already wrote the code last time on this commit but I still have to change the docs and write some tests though. I'm working on it. However, I still your advice on one matter. EDIT:
|
Hi ! Awesome :) Maybe you can pre-load the next sample to know if the dataset is empty or not ? |
Hi @ylacombe let us know if we can help with anything :) |
Hi @lhoestq, I've finally made some advances in the matter. I've modified the |
Thanks @ylacombe :) Using the
In our case it can be nice to define a |
Resolved via #5036. |
In #4831 @ylacombe added an oversampling strategy for
interleave_datasets
. However right now it doesn't work for datasets loaded usingload_dataset(..., streaming=True)
, which areIterableDataset
objects.It would be nice to expand
interleave_datasets
for iterable datasets as well to support this oversampling strategyThis can be implemented by adding the strategy to both
CyclingMultiSourcesExamplesIterable
andRandomlyCyclingMultiSourcesExamplesIterable
used in_interleave_iterable_datasets
initerable_dataset.py
I would be happy to share some guidance if anyone would like to give it a shot :)
The text was updated successfully, but these errors were encountered: