Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversampling strategy for iterable datasets in interleave_datasets #4893

Closed
lhoestq opened this issue Aug 25, 2022 · 9 comments
Closed

Oversampling strategy for iterable datasets in interleave_datasets #4893

lhoestq opened this issue Aug 25, 2022 · 9 comments
Assignees
Labels
good second issue Issues a bit more difficult than "Good First" issues

Comments

@lhoestq
Copy link
Member

lhoestq commented Aug 25, 2022

In #4831 @ylacombe added an oversampling strategy for interleave_datasets. However right now it doesn't work for datasets loaded using load_dataset(..., streaming=True), which are IterableDataset objects.

It would be nice to expand interleave_datasets for iterable datasets as well to support this oversampling strategy

>>> from datasets.iterable_dataset import IterableDataset, ExamplesIterable
>>> d1 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [0, 1, 2]], {}))
>>> d2 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [10, 11, 12, 13]], {}))
>>> d3 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [20, 21, 22, 23, 24]], {}))
>>> dataset = interleave_datasets([d1, d2, d3])  # is supported
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")  # is not supported yet
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]

This can be implemented by adding the strategy to both CyclingMultiSourcesExamplesIterable and RandomlyCyclingMultiSourcesExamplesIterable used in _interleave_iterable_datasets in iterable_dataset.py

I would be happy to share some guidance if anyone would like to give it a shot :)

@lhoestq lhoestq added the good second issue Issues a bit more difficult than "Good First" issues label Aug 25, 2022
@ylacombe
Copy link
Contributor

Hi @lhoestq,
I plunged into the code and it should be manageable for me to work on it!
#take

Also, setting d1, d2 and d3 as you did raised a SyntaxError: 'yield' inside list comprehension for me, on Python 3.8.10.
The following snippet works for me though:

d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))

@lhoestq
Copy link
Member Author

lhoestq commented Aug 25, 2022

Great @ylacombe thanks ! I'm assigning you this issue

@lhoestq
Copy link
Member Author

lhoestq commented Sep 6, 2022

Hi @ylacombe :) Is there anything I can do to help ? Feel free to ping me if you have any question :)

@ylacombe
Copy link
Contributor

ylacombe commented Sep 12, 2022

Hi @lhoestq,

I actually have already wrote the code last time on this commit but I still have to change the docs and write some tests though. I'm working on it.

However, I still your advice on one matter.
In #4831, when using a Dataset list with probabilities, I had change the original behavior so that it stops as soon as one or all datasets are out of samples. By nature, this behavior can't be applied with an IterableDataset because one only knows an iterable dataset is out of sample when receiving a StopIteration error after calling the iterator once again.
To sum up, as it is right know, the behavior is not consistent with an IterableDataset list or a Dataset list, when using probabilities.
To be honest, I think that the current behavior with a Dataset list is desirable and avoid having too many samples, so I would recommand keeping that as it is, but I can understand the desire to have the same behavior for both classes.
What do you think ? Please let me know if you need more details.

EDIT:
Here is an example:

>>> from tests.test_iterable_dataset import *
>>> d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
>>> d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
>>> d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> [x["a"] for x in dataset]
[10, 0, 11, 1, 2, 20, 12, 13]
>>> from tests.test_arrow_dataset import *
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)["a"]
[10, 0, 11, 1, 2]
[10, 0, 11, 1, 2]

@lhoestq
Copy link
Member Author

lhoestq commented Sep 13, 2022

Hi ! Awesome :)

Maybe you can pre-load the next sample to know if the dataset is empty or not ?
This way it should be possible to have the same behavior for IterableDataset

@lhoestq
Copy link
Member Author

lhoestq commented Sep 22, 2022

Hi @ylacombe let us know if we can help with anything :)

@ylacombe
Copy link
Contributor

Hi @lhoestq, I've finally made some advances in the matter. I've modified the IterableDataset behavior so that it aligns with the Dataset behavior as we have discussed. The documentation has been dealt with too.
It works as expected on my examples. However I'm having trouble figuring out how to test interleave_datasets on test_iterable_datasets.py as I have never worked with pytest. Could you help me on that or give me some indications?

@lhoestq
Copy link
Member Author

lhoestq commented Sep 27, 2022

Thanks @ylacombe :)

Using the pytest command, you can run all the functions in a python file that start with "test_*" and make sure they return not errors:

pytest tests/test_iterable_dataset.py

In our case it can be nice to define a test_interleave_datasets_with_oversampling function. This function can contain the code example that we mentioned earlier in this github issue to make sure it works as expected.

@mariosasko
Copy link
Collaborator

Resolved via #5036.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good second issue Issues a bit more difficult than "Good First" issues
Projects
None yet
Development

No branches or pull requests

3 participants