Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oversampling strategies to interleave datasets #4831

Conversation

ylacombe
Copy link
Contributor

Hello everyone,
Here is a proposal to improve interleave_datasets function.
Following Issue #3064, and @lhoestq comment, I propose here a code that performs oversampling when interleaving a Dataset list.

I have myself encountered this problem while trying to implement training on a multilingual dataset following a training strategy similar to that of XLSUM paper, a multilingual abstract summary dataset where the multilingual training dataset is created by sampling from the languages following a smoothing strategy. The main idea is to sample languages that have a low number of samples more frequently than other languages.

As in Issue #3064, the current default strategy is a undersampling strategy, which stops as soon as a dataset runs out of samples. The new all_exhausted strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.

How does it work in practice:

  • if probabilities is None and the strategy is all_exhausted, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will be $maxLengthDataset*nbDataset$.
  • if probabilities is not None and the strategy is all_exhausted, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.
  • In the other cases, it is supposed to keep the same behaviour as before. Except that this time, when probabilities are precised, it really stops AS SOON AS a dataset is out of samples.

More on the last sentence:
The previous example of interleave_datasets was:

    >>> from datasets import Dataset, interleave_datasets
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12]

With my implementation, dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) gives:
>>> dataset["a"]
[10, 0, 11, 1, 2]
because d1 is already out of samples just after 2 is added.

Example of the results of applying the different strategies:

    >>> from datasets import Dataset, interleave_datasets
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2]
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
    >>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
    >>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]})
    >>> dataset = interleave_datasets([d1, d2, d3])
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
    >>> dataset["a"]
    [10, 0, 11, 1, 2]
    >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
    >>> dataset["a"]
    [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24]

Final note: I've been using that code for a research project involving a large-scale multilingual dataset. One should be careful when using oversampling to avoid to avoid exploding the size of the dataset. For example, if a very large data set has a low probability of being sampled, the final dataset may be several times the size of that large data set.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you ! This is amazing :)

Could you add a test case in test_arrow_dataset.py to test this stopping strategy please ?

I think we can also mention it in the documentation in process.mdx (rendered at https://hf.co/docs/datasets/process): what about about a section on Interleave right after the Concatenate section ?

Currently there's just a "Tip" that redirects to the streaming/interleave docs

if (iterable and stopping_strategy != "first_exhausted") or (
stopping_strategy not in ["first_exhausted", "all_exhausted"]
):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a ValueError if stopping_strategy not in ["first_exhausted", "all_exhausted"] and a NotImplementedError if iterable and stopping_strategy != "first_exhausted"

@ylacombe
Copy link
Contributor Author

Hi @lhoestq,
Thanks for your review! I've added the requested mention in the documentation and corrected the Error type in interleave_datasets.
I've also added test cases in test_arrow_dataset.py, which was useful since it allow me to detect an error in the case of an oversampling strategy with no sampling probabilities.
Could you double check this part ? I've commented the code to explain the approach.
Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks all good to me ! Thanks again, and for the tests and docs as well :)

Merging !


# Reasoning behind the following operation: for each dataset indices (i.e column) repeat the indices to have max_length indices per dataset
# For example, if the max_length is 5 and the i-th dataset has 3 samples, the i-th column will be [0,1,2,0,1]
indices = np.mod(np.arange(max(lengths)).reshape(-1, 1), np.array(lengths).reshape(1, -1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice trick !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that comment and for the review!

@lhoestq lhoestq merged commit dc5cb17 into huggingface:main Aug 24, 2022
@ylacombe ylacombe deleted the add-oversampling-strategies-to-interleave-datasets branch August 25, 2022 08:21
@ymoslem
Copy link

ymoslem commented Jul 1, 2023

@ylacombe Thanks for your effort!

Final note: I've been using that code for a research project involving a large-scale multilingual dataset. One should be careful when using oversampling to avoid exploding the size of the dataset. For example, if a very large data set has a low probability of being sampled, the final dataset may be several times the size of that large data set.

May I ask why is that, and how to solve it? In some scenarios, such as domain adaptation with limited resources, it is normal to have a big generic dataset and a small in-domain dataset.

Here is an example with data sizes 8:2 and oversampling ratios 0.2:0.8

from datasets import Dataset, interleave_datasets

d1 = Dataset.from_dict({"a": [1, 2, 3, 4, 5, 6, 7, 8]})
d2 = Dataset.from_dict({"a": [9, 10]})

new_d = interleave_datasets([d1, d2], probabilities=[0.2, 0.8], seed=42, stopping_strategy="all_exhausted")
print(len(new_d))
print(new_d["a"])

37
[9, 10, 9, 10, 1, 9, 10, 9, 2, 10, 9, 10, 9, 10, 9, 10, 9, 3, 10, 9, 10, 9, 10, 9, 10, 4, 9, 5, 6, 10, 9, 10, 9, 10, 9, 7, 8]

The ratios sampled from the two original datasets to the output dataset are correct. However, the length of the output dataset is 37, which is too big. I think it should be only large enough to make the smaller dataset similar in size to the bigger dataset. Any solution for this? Many thanks!

@ylacombe
Copy link
Contributor Author

ylacombe commented Jul 3, 2023

Hi @ymoslem, it's a great question and yes, it's normal to have two different-sized datasets to interleave!

My recommendation here would be to either use probabilities more biased towards the large model (e.g [0.8, 0.2]) so that the big dataset is exhausted more quickly, or to not use probabilities altogether - in that case, new_d length will be 16 (nb_datasets*len(largest_dataset)).

Let me know if I need to be clearer!

@ymoslem
Copy link

ymoslem commented Jul 11, 2023

@ylacombe Many thanks for your prompt response! As we needed to implement certain oversampling experiments, we ended up using Pandas.

Considering each dataset a class with a distinct "label":

import pandas as pd

def oversample(df):
  classes = df.label.value_counts().to_dict()
  most = max(classes.values())
  classes_list = []
  for key in classes:
    classes_list.append(df[df['label'] == key])
  classes_sample = []
  for i in range(1,len(classes_list)):
    classes_sample.append(classes_list[i].sample(most, replace=True))
  df_maybe = pd.concat(classes_sample)
  final_df = pd.concat([df_maybe,classes_list[0]], axis=0)
  final_df = final_df.reset_index(drop=True)
  return final_df

Reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants