Oversampling strategy for iterable datasets in `interleave_datasets` #4893

lhoestq · 2022-08-25T10:06:55Z

In #4831 @ylacombe added an oversampling strategy for interleave_datasets. However right now it doesn't work for datasets loaded using load_dataset(..., streaming=True), which are IterableDataset objects.

It would be nice to expand interleave_datasets for iterable datasets as well to support this oversampling strategy

>>> from datasets.iterable_dataset import IterableDataset, ExamplesIterable
>>> d1 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [0, 1, 2]], {}))
>>> d2 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [10, 11, 12, 13]], {}))
>>> d3 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [20, 21, 22, 23, 24]], {}))
>>> dataset = interleave_datasets([d1, d2, d3])  # is supported
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")  # is not supported yet
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]

This can be implemented by adding the strategy to both CyclingMultiSourcesExamplesIterable and RandomlyCyclingMultiSourcesExamplesIterable used in _interleave_iterable_datasets in iterable_dataset.py

I would be happy to share some guidance if anyone would like to give it a shot :)

The text was updated successfully, but these errors were encountered:

ylacombe · 2022-08-25T13:23:57Z

Hi @lhoestq,
I plunged into the code and it should be manageable for me to work on it!
#take

Also, setting d1, d2 and d3 as you did raised a SyntaxError: 'yield' inside list comprehension for me, on Python 3.8.10.
The following snippet works for me though:

d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))

lhoestq · 2022-08-25T15:42:07Z

Great @ylacombe thanks ! I'm assigning you this issue

lhoestq · 2022-09-06T10:19:29Z

Hi @ylacombe :) Is there anything I can do to help ? Feel free to ping me if you have any question :)

ylacombe · 2022-09-12T16:00:54Z

Hi @lhoestq,

I actually have already wrote the code last time on this commit but I still have to change the docs and write some tests though. I'm working on it.

However, I still your advice on one matter.
In #4831, when using a Dataset list with probabilities, I had change the original behavior so that it stops as soon as one or all datasets are out of samples. By nature, this behavior can't be applied with an IterableDataset because one only knows an iterable dataset is out of sample when receiving a StopIteration error after calling the iterator once again.
To sum up, as it is right know, the behavior is not consistent with an IterableDataset list or a Dataset list, when using probabilities.
To be honest, I think that the current behavior with a Dataset list is desirable and avoid having too many samples, so I would recommand keeping that as it is, but I can understand the desire to have the same behavior for both classes.
What do you think ? Please let me know if you need more details.

EDIT:
Here is an example:

>>> from tests.test_iterable_dataset import *
>>> d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
>>> d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
>>> d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> [x["a"] for x in dataset]
[10, 0, 11, 1, 2, 20, 12, 13]
>>> from tests.test_arrow_dataset import *
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)["a"]
[10, 0, 11, 1, 2]
[10, 0, 11, 1, 2]

lhoestq · 2022-09-13T11:50:39Z

Hi ! Awesome :)

Maybe you can pre-load the next sample to know if the dataset is empty or not ?
This way it should be possible to have the same behavior for IterableDataset

lhoestq · 2022-09-22T13:20:57Z

Hi @ylacombe let us know if we can help with anything :)

ylacombe · 2022-09-25T14:50:23Z

Hi @lhoestq, I've finally made some advances in the matter. I've modified the IterableDataset behavior so that it aligns with the Dataset behavior as we have discussed. The documentation has been dealt with too.
It works as expected on my examples. However I'm having trouble figuring out how to test interleave_datasets on test_iterable_datasets.py as I have never worked with pytest. Could you help me on that or give me some indications?

lhoestq · 2022-09-27T15:54:44Z

Thanks @ylacombe :)

Using the pytest command, you can run all the functions in a python file that start with "test_*" and make sure they return not errors:

pytest tests/test_iterable_dataset.py

In our case it can be nice to define a test_interleave_datasets_with_oversampling function. This function can contain the code example that we mentioned earlier in this github issue to make sure it works as expected.

mariosasko · 2022-10-03T12:37:46Z

Resolved via #5036.

lhoestq added the good second issue Issues a bit more difficult than "Good First" issues label Aug 25, 2022

lhoestq assigned ylacombe Aug 25, 2022

ylacombe mentioned this issue Sep 28, 2022

Add oversampling strategy iterable datasets interleave #5036

Merged

mariosasko closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oversampling strategy for iterable datasets in `interleave_datasets` #4893

Oversampling strategy for iterable datasets in `interleave_datasets` #4893

lhoestq commented Aug 25, 2022

ylacombe commented Aug 25, 2022

lhoestq commented Aug 25, 2022

lhoestq commented Sep 6, 2022

ylacombe commented Sep 12, 2022 •

edited

lhoestq commented Sep 13, 2022

lhoestq commented Sep 22, 2022

ylacombe commented Sep 25, 2022

lhoestq commented Sep 27, 2022

mariosasko commented Oct 3, 2022

Oversampling strategy for iterable datasets in interleave_datasets #4893

Oversampling strategy for iterable datasets in interleave_datasets #4893

Comments

lhoestq commented Aug 25, 2022

ylacombe commented Aug 25, 2022

lhoestq commented Aug 25, 2022

lhoestq commented Sep 6, 2022

ylacombe commented Sep 12, 2022 • edited

lhoestq commented Sep 13, 2022

lhoestq commented Sep 22, 2022

ylacombe commented Sep 25, 2022

lhoestq commented Sep 27, 2022

mariosasko commented Oct 3, 2022

Oversampling strategy for iterable datasets in `interleave_datasets` #4893

Oversampling strategy for iterable datasets in `interleave_datasets` #4893

ylacombe commented Sep 12, 2022 •

edited