Add oversampling strategy iterable datasets interleave #5036

ylacombe · 2022-09-28T10:10:23Z

Hello everyone,
Following the issue #4893 and the PR #4831, I propose here an oversampling strategy for a IterableDataset list.
The all_exhausted strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.
It follows roughly the same logic behind #4831, namely:

if probabilities is None and the strategy is all_exhausted, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will be $maxLengthDataset*nbDataset$.
if probabilities is not None and the strategy is all_exhausted, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.

In order to be consistent and also to align with the Dataset behavior, please note that the behavior of the default strategy (first_exhausted) has been changed. Namely, it really stops when a dataset is out of samples whereas it used to stop when receiving the StopIteration error.
To give an example of the last note, with the following snippet:

>>> from tests.test_iterable_dataset import *
>>> d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
>>> d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
>>> d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))
>>> dataset = interleave_datasets([d1, d2, d3])
>>> [x["a"] for x in dataset]

The result here will then be [10, 0, 11, 1, 2] instead of [10, 0, 11, 1, 2, 20, 12, 13].

I modified the behavior because I found it to be consistent with the under/oversampling approach and because it unified the undersampling and oversampling code, but I stay open to any suggestions.

…for IterableDatasets

…gMultiSourcesExamplesIterable to avoid code redundancy

…g strategy

…interleave

HuggingFaceDocBuilderDev · 2022-09-28T13:33:40Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Awesome thanks ! Good idea to have HasNextIterator :)

I just have one comment:

src/datasets/iterable_dataset.py

Remove resetting of empty iterators Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq

LGTM ! Thank you :D

Yoach Lacombe and others added 11 commits August 25, 2022 15:30

add IterableDatasets oversampling strategy in interleave_datasets

84769db

add oversampling strategy for iterable datasets

bfb2f1d

update docs regarding interleave_datasets with oversampling strategy …

2467ba9

…for IterableDatasets

correct the is_exhausted index in order to avoid unconsistent behavior

026ca36

refactoring of RandomlyCyclingMultiSourcesExamplesIterable and Cyclin…

d303368

…gMultiSourcesExamplesIterable to avoid code redundancy

add hardcoded test for interleaving IterableDatasets with oversamplin…

f4a65ba

…g strategy

add test for interleaving IterableDatasets with oversampling strategy

16b3b46

modify tests for cycling multiple IterableDatasets

c58968a

update the interleave_datasets doc

b9f553d

Merge branch 'main' into add-oversampling-strategy-iterable-datasets-…

459d8c4

…interleave

make style following conflict

93132af

lhoestq reviewed Sep 29, 2022

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

Update src/datasets/iterable_dataset.py

e79809b

Remove resetting of empty iterators Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq approved these changes Sep 30, 2022

View reviewed changes

lhoestq merged commit 1529bdc into huggingface:main Sep 30, 2022

ylacombe deleted the add-oversampling-strategy-iterable-datasets-interleave branch September 30, 2022 12:30

mariosasko mentioned this pull request Oct 3, 2022

Oversampling strategy for iterable datasets in interleave_datasets #4893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add oversampling strategy iterable datasets interleave #5036

Add oversampling strategy iterable datasets interleave #5036

ylacombe commented Sep 28, 2022

HuggingFaceDocBuilderDev commented Sep 28, 2022 •

edited

lhoestq left a comment

lhoestq left a comment

Add oversampling strategy iterable datasets interleave #5036

Add oversampling strategy iterable datasets interleave #5036

Conversation

ylacombe commented Sep 28, 2022

HuggingFaceDocBuilderDev commented Sep 28, 2022 • edited

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 28, 2022 •

edited