Add oversampling strategies to interleave datasets #4831

ylacombe · 2022-08-11T16:24:51Z

Hello everyone,
Here is a proposal to improve interleave_datasets function.
Following Issue #3064, and @lhoestq comment, I propose here a code that performs oversampling when interleaving a Dataset list.

I have myself encountered this problem while trying to implement training on a multilingual dataset following a training strategy similar to that of XLSUM paper, a multilingual abstract summary dataset where the multilingual training dataset is created by sampling from the languages following a smoothing strategy. The main idea is to sample languages that have a low number of samples more frequently than other languages.

As in Issue #3064, the current default strategy is a undersampling strategy, which stops as soon as a dataset runs out of samples. The new all_exhausted strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.

How does it work in practice:

if probabilities is None and the strategy is all_exhausted, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will be $maxLengthDataset*nbDataset$.
if probabilities is not None and the strategy is all_exhausted, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.
In the other cases, it is supposed to keep the same behaviour as before. Except that this time, when probabilities are precised, it really stops AS SOON AS a dataset is out of samples.

More on the last sentence:
The previous example of interleave_datasets was:

    >>> from datasets import Dataset, interleave_datasets
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12]

With my implementation, dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) gives:
>>> dataset["a"]
[10, 0, 11, 1, 2]
because d1 is already out of samples just after 2 is added.

Example of the results of applying the different strategies:

    >>> from datasets import Dataset, interleave_datasets
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2]
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]})
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24]

Final note: I've been using that code for a research project involving a large-scale multilingual dataset. One should be careful when using oversampling to avoid to avoid exploding the size of the dataset. For example, if a very large data set has a low probability of being sampled, the final dataset may be several times the size of that large data set.

HuggingFaceDocBuilderDev · 2022-08-12T04:06:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq

Awesome thank you ! This is amazing :)

Could you add a test case in test_arrow_dataset.py to test this stopping strategy please ?

I think we can also mention it in the documentation in process.mdx (rendered at https://hf.co/docs/datasets/process): what about about a section on Interleave right after the Concatenate section ?

Currently there's just a "Tip" that redirects to the streaming/interleave docs

lhoestq · 2022-08-17T13:41:13Z

src/datasets/combine.py

+    if (iterable and stopping_strategy != "first_exhausted") or (
+        stopping_strategy not in ["first_exhausted", "all_exhausted"]
+    ):
+        raise ValueError(


I think it should be a ValueError if stopping_strategy not in ["first_exhausted", "all_exhausted"] and a NotImplementedError if iterable and stopping_strategy != "first_exhausted"

…atasets

…nterleave_map_style_datasets and add comments

…ntation of process.mdx

ylacombe · 2022-08-24T12:33:40Z

Hi @lhoestq,
Thanks for your review! I've added the requested mention in the documentation and corrected the Error type in interleave_datasets.
I've also added test cases in test_arrow_dataset.py, which was useful since it allow me to detect an error in the case of an oversampling strategy with no sampling probabilities.
Could you double check this part ? I've commented the code to explain the approach.
Thanks!

lhoestq

It looks all good to me ! Thanks again, and for the tests and docs as well :)

Merging !

lhoestq · 2022-08-24T16:45:15Z

src/datasets/arrow_dataset.py

+
+        # Reasoning behind the following operation: for each dataset indices (i.e column) repeat the indices to have max_length indices per dataset
+        # For example, if the max_length is 5 and the i-th dataset has 3 samples, the i-th column will be [0,1,2,0,1]
+        indices = np.mod(np.arange(max(lengths)).reshape(-1, 1), np.array(lengths).reshape(1, -1))


nice trick !

Thanks for that comment and for the review!

ymoslem · 2023-07-01T21:42:29Z

@ylacombe Thanks for your effort!

Final note: I've been using that code for a research project involving a large-scale multilingual dataset. One should be careful when using oversampling to avoid exploding the size of the dataset. For example, if a very large data set has a low probability of being sampled, the final dataset may be several times the size of that large data set.

May I ask why is that, and how to solve it? In some scenarios, such as domain adaptation with limited resources, it is normal to have a big generic dataset and a small in-domain dataset.

Here is an example with data sizes 8:2 and oversampling ratios 0.2:0.8

from datasets import Dataset, interleave_datasets

d1 = Dataset.from_dict({"a": [1, 2, 3, 4, 5, 6, 7, 8]})
d2 = Dataset.from_dict({"a": [9, 10]})

new_d = interleave_datasets([d1, d2], probabilities=[0.2, 0.8], seed=42, stopping_strategy="all_exhausted")
print(len(new_d))
print(new_d["a"])

37
[9, 10, 9, 10, 1, 9, 10, 9, 2, 10, 9, 10, 9, 10, 9, 10, 9, 3, 10, 9, 10, 9, 10, 9, 10, 4, 9, 5, 6, 10, 9, 10, 9, 10, 9, 7, 8]

The ratios sampled from the two original datasets to the output dataset are correct. However, the length of the output dataset is 37, which is too big. I think it should be only large enough to make the smaller dataset similar in size to the bigger dataset. Any solution for this? Many thanks!

ylacombe · 2023-07-03T13:36:12Z

Hi @ymoslem, it's a great question and yes, it's normal to have two different-sized datasets to interleave!

My recommendation here would be to either use probabilities more biased towards the large model (e.g [0.8, 0.2]) so that the big dataset is exhausted more quickly, or to not use probabilities altogether - in that case, new_d length will be 16 (nb_datasets*len(largest_dataset)).

Let me know if I need to be clearer!

ymoslem · 2023-07-11T15:56:58Z

@ylacombe Many thanks for your prompt response! As we needed to implement certain oversampling experiments, we ended up using Pandas.

Considering each dataset a class with a distinct "label":

import pandas as pd

def oversample(df):
  classes = df.label.value_counts().to_dict()
  most = max(classes.values())
  classes_list = []
  for key in classes:
    classes_list.append(df[df['label'] == key])
  classes_sample = []
  for i in range(1,len(classes_list)):
    classes_sample.append(classes_list[i].sample(most, replace=True))
  df_maybe = pd.concat(classes_sample)
  final_df = pd.concat([df_maybe,classes_list[0]], axis=0)
  final_df = final_df.reset_index(drop=True)
  return final_df

Reference

Yoach Lacombe added 3 commits August 11, 2022 17:41

add a new strategy for interleave_datasets (oversampling strat)

edb756a

format code according to the library style

345a5e3

update interleave_datasets description

da58422

lhoestq reviewed Aug 17, 2022

View reviewed changes

Yoach Lacombe added 7 commits August 24, 2022 11:27

Add correct Error type for a non implemented strategy in interleave_d…

c0236dd

…atasets

correcting an example in the comments

76ab4c3

adding comment to the default case of _interleave_map_style_datasets

c63379a

correct the case of oversampling strategy with no probabilities of _i…

88d12f1

…nterleave_map_style_datasets and add comments

reformat with datasets's style

de490eb

add tests for oversampling strategy in interleave_datasets

1be09cc

mention of the sampling strategy of interleave_datasets in the docume…

7ce7a15

…ntation of process.mdx

lhoestq approved these changes Aug 24, 2022

View reviewed changes

lhoestq merged commit dc5cb17 into huggingface:main Aug 24, 2022

ylacombe deleted the add-oversampling-strategies-to-interleave-datasets branch August 25, 2022 08:21

lhoestq mentioned this pull request Aug 25, 2022

Oversampling strategy for iterable datasets in interleave_datasets #4893

Closed

ylacombe mentioned this pull request Sep 28, 2022

Add oversampling strategy iterable datasets interleave #5036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add oversampling strategies to interleave datasets #4831

Add oversampling strategies to interleave datasets #4831

ylacombe commented Aug 11, 2022

HuggingFaceDocBuilderDev commented Aug 12, 2022

lhoestq left a comment

lhoestq Aug 17, 2022

ylacombe commented Aug 24, 2022

lhoestq left a comment

lhoestq Aug 24, 2022

ylacombe Aug 25, 2022

ymoslem commented Jul 1, 2023 •

edited

ylacombe commented Jul 3, 2023

ymoslem commented Jul 11, 2023 •

edited

Add oversampling strategies to interleave datasets #4831

Add oversampling strategies to interleave datasets #4831

Conversation

ylacombe commented Aug 11, 2022

HuggingFaceDocBuilderDev commented Aug 12, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Aug 17, 2022

Choose a reason for hiding this comment

ylacombe commented Aug 24, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Aug 24, 2022

Choose a reason for hiding this comment

ylacombe Aug 25, 2022

Choose a reason for hiding this comment

ymoslem commented Jul 1, 2023 • edited

ylacombe commented Jul 3, 2023

ymoslem commented Jul 11, 2023 • edited

ymoslem commented Jul 1, 2023 •

edited

ymoslem commented Jul 11, 2023 •

edited