Skip to content

Commit

Permalink
Add oversampling strategies to interleave datasets (#4831)
Browse files Browse the repository at this point in the history
* add a new strategy for interleave_datasets (oversampling strat)

* format code according to the library style

* update interleave_datasets description

* Add correct Error type for a non implemented strategy in interleave_datasets

* correcting an example in the comments

* adding comment to the default case of _interleave_map_style_datasets

* correct the case of oversampling strategy with no probabilities of _interleave_map_style_datasets and add comments

* reformat with datasets's style

* add tests for oversampling strategy in interleave_datasets

* mention of the sampling strategy of interleave_datasets in the documentation of process.mdx
  • Loading branch information
ylacombe committed Aug 24, 2022
1 parent 9e9cf48 commit dc5cb17
Show file tree
Hide file tree
Showing 4 changed files with 161 additions and 14 deletions.
38 changes: 32 additions & 6 deletions docs/source/process.mdx
Expand Up @@ -489,12 +489,6 @@ Separate datasets can be concatenated if they share the same column types. Conca
>>> bert_dataset = concatenate_datasets([bookcorpus, wiki])
```

<Tip>

You can also mix several datasets together by taking alternating examples from each one to create a new dataset. This is known as *interleaving*, which is enabled by the [`interleave_datasets`] function. Both [`interleave_datasets`] and [`concatenate_datasets`] work with regular [`Dataset`] and [`IterableDataset`] objects. Refer to the [Stream](./stream#interleave) guide for an example of how to interleave datasets.

</Tip>

You can also concatenate two datasets horizontally by setting `axis=1` as long as the datasets have the same number of rows:

```py
Expand All @@ -503,6 +497,38 @@ You can also concatenate two datasets horizontally by setting `axis=1` as long a
>>> bookcorpus_with_ids = concatenate_datasets([bookcorpus, bookcorpus_ids], axis=1)
```

### Interleave

You can also mix several datasets together by taking alternating examples from each one to create a new dataset. This is known as *interleaving*, which is enabled by the [`interleave_datasets`] function. Both [`interleave_datasets`] and [`concatenate_datasets`] work with regular [`Dataset`] and [`IterableDataset`] objects.
Refer to the [Stream](./stream#interleave) guide for an example of how to interleave [`IterableDataset`] objects.

You can define sampling probabilities for each of the original datasets to specify how to interleave the datasets.
In this case, the new dataset is constructed by getting examples one by one from a random dataset until one of the datasets runs out of samples.

```py
>>> seed = 42
>>> probabilities = [0.3, 0.5, 0.2]
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=probabilities, seed=seed)
>>> dataset["a"]
[10, 11, 20, 12, 0, 21, 13]
```

In the case of [`Dataset`] objects, you can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is an subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples.
You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
Note that if no sampling probabilities are specified, the new dataset will have `max_length_datasets*nb_dataset samples`

```py
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 20]
```

## Format

The [`~Dataset.set_format`] function changes the format of a column to be compatible with some common data formats. Specify the output you'd like in the `type` parameter and the columns you want to format. Formatting is applied on-the-fly.
Expand Down
54 changes: 51 additions & 3 deletions src/datasets/arrow_dataset.py
Expand Up @@ -4851,6 +4851,7 @@ def _interleave_map_style_datasets(
seed: Optional[int] = None,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
stopping_strategy: Optional[str] = "first_exhausted",
**kwargs,
) -> "Dataset":
"""
Expand All @@ -4866,24 +4867,62 @@ def _interleave_map_style_datasets(
seed (:obj:`int`, optional, default None): The random seed used to choose a source for each example.
info (:class:`DatasetInfo`, optional): Dataset information, like description, citation, etc.
split (:class:`NamedSplit`, optional): Name of the dataset split.
stopping_strategy (Optional :obj:`str`, defaults to `first_exhausted`):
Two strategies are proposed right now.
By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples.
If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once.
Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous:
- with no probabilities, the resulting dataset will have max_length_datasets*nb_dataset samples.
- with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
**kwargs (additional keyword arguments): Keyword arguments to be passed to :meth:`datasets.Datasets.select` when selecting the indices used to interleave the datasets.
Output:
:class:`datasets.Dataset`
"""
if stopping_strategy not in ["first_exhausted", "all_exhausted"]:
raise ValueError(
f"{stopping_strategy} stopping strategy in `interleave_datasets` is not implemented yet with a list of {type(datasets[0])}"
)

# To interleave the datasets, we concatenate them and then we re-order the indices
concatenated_datasets = _concatenate_map_style_datasets(datasets, info=info, split=split)

# Let's now build the indices to pass to .select()
lengths = [len(dset) for dset in datasets]
offsets = np.cumsum([0] + lengths[:-1])
if probabilities is None:

# if stopping_strategy is "first_exhausted", it is an undersampling situation whereas it is an oversampling situation if it is "all_exhausted"
oversampling = stopping_strategy == "all_exhausted"

if probabilities is None and not oversampling:
# Undersampling situation with cycling between each sources
# Example:: If lengths of the datasets are [3, 4, 5]
# Then the resulting indices should be [0, 3, 7, 1, 4, 8, 2, 6, 9]
# Note that we only have 3 examples per dataset since the first dataset ran out of examples

# Reasoning behind the following operation: keeping the min_length first indices of each dataset
# while offsetting in order to correspond to the right indices of the concatenated dataset
# and flattening to effectively interleave the datasets
indices = (offsets.reshape(1, -1) + np.arange(min(lengths)).reshape(-1, 1)).flatten().tolist()
elif probabilities is None:
# Oversampling situation with cycling between each sources
# Then the resulting indices should be [0, 3, 7, 1, 4, 8, 2, 5, 9, 0, 6, 10, 1, 3, 11]
# Note that we have 5 examples per dataset with a rolling window since the longest dataset has 5 samples

# Reasoning behind the following operation: for each dataset indices (i.e column) repeat the indices to have max_length indices per dataset
# For example, if the max_length is 5 and the i-th dataset has 3 samples, the i-th column will be [0,1,2,0,1]
indices = np.mod(np.arange(max(lengths)).reshape(-1, 1), np.array(lengths).reshape(1, -1))

# We have to keep the indices to their respective dataset offsets and to flatten to effectively interleave the datasets
indices = (indices + offsets).flatten().tolist()

else:
# boolean array indicating if at index i if the dataset_i has been fully exhausted
is_exhausted = np.full(len(lengths), False)

# if undersampling ("first_exhausted"), we stop as soon as one dataset is exhausted
# if oversampling ("all_exhausted"), we stop as soons as every dataset is exhausted, i.e as soon as every samples of every dataset has been visited at least once
bool_strategy_func = np.all if oversampling else np.any

def iter_random_indices():
"""Get an infinite iterator that randomly samples the index of the source to pick examples from."""
Expand All @@ -4894,12 +4933,21 @@ def iter_random_indices():
current_index = [0] * len(datasets)
indices = []
for source_idx in iter_random_indices():
# we ran out of examples, let's stop
if current_index[source_idx] >= lengths[source_idx]:
# If no oversampling, we stop as soon as a dataset has ran out of examples (np.any)
# Otherwise, we stop as soon as every dataset has ran out of examples (np.all)
if bool_strategy_func(is_exhausted):
# the stopping condition was reached, let's stop
break

# let's add the example at the current index of the `source_idx`-th dataset
indices.append(current_index[source_idx] + offsets[source_idx])
current_index[source_idx] += 1

# we've ran out of examples for the current dataset, let's update our boolean array and bring the current_index back to 0
if current_index[source_idx] >= lengths[source_idx]:
is_exhausted[source_idx] = True
current_index[source_idx] = 0

return concatenated_datasets.select(indices, **kwargs)


Expand Down
48 changes: 43 additions & 5 deletions src/datasets/combine.py
Expand Up @@ -19,6 +19,7 @@ def interleave_datasets(
seed: Optional[int] = None,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
stopping_strategy: Optional[str] = "first_exhausted",
) -> DatasetType:
"""
Interleave several datasets (sources) into a single dataset.
Expand All @@ -29,7 +30,8 @@ def interleave_datasets(
If ``probabilities`` is ``None`` (default) the new dataset is constructed by cycling between each source to get the examples.
If ``probabilities`` is not ``None``, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities.
The resulting dataset ends when one of the source datasets runs out of examples.
The resulting dataset ends when one of the source datasets runs out of examples except when ``oversampling`` is ``True`` and :class:`Dataset` objects are used,
in which case, the resulting dataset ends when all datasets have ran out of examples at least one time.
Args:
datasets (:obj:`List[Dataset]` or :obj:`List[IterableDataset]`): list of datasets to interleave
Expand All @@ -40,7 +42,14 @@ def interleave_datasets(
<Added version="2.4.0"/>
split ([`NamedSplit`], *optional*): Name of the dataset split.
<Added version="2.4.0"/>
stopping_strategy (Optional :obj:`str`, defaults to `first_exhausted`):
Two strategies are proposed right now for :class:`Dataset` objects.
For :class:`IterableDataset` objects, only `first_exhausted` is proposed right now.
By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples.
If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once.
Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous:
- with no probabilities, the resulting dataset will have max_length_datasets*nb_dataset samples.
- with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
Returns:
:class:`Dataset` or :class:`IterableDataset`: Return type depends on the input `datasets`
parameter. `Dataset` if the input is a list of `Dataset`, `IterableDataset` if the input is a list of
Expand All @@ -50,17 +59,38 @@ def interleave_datasets(
For regular datasets (map-style):
>>> from datasets import Dataset, interleave_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> dataset["a"]
[10, 0, 11, 1, 2]
>>> dataset = interleave_datasets([d1, d2, d3])
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]})
>>> dataset = interleave_datasets([d1, d2, d3])
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12]
[10, 0, 11, 1, 2]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24]
For datasets in streaming mode (iterable):
>>> from datasets import load_dataset, interleave_datasets
Expand Down Expand Up @@ -89,8 +119,16 @@ def interleave_datasets(
raise ValueError(
f"Unable to interleave a {type(datasets[0])} with a {type(dataset)}. Expected a list of Dataset objects or a list of IterableDataset objects."
)
if iterable and stopping_strategy != "first_exhausted":
raise NotImplementedError(
f"{stopping_strategy} stopping strategy in `interleave_datasets` is not implemented yet with a list of {type(datasets[0])}."
)
if stopping_strategy not in ["first_exhausted", "all_exhausted"]:
raise ValueError(f"{stopping_strategy} is not supported. Please enter a valid stopping_strategy.")
if map_style:
return _interleave_map_style_datasets(datasets, probabilities, seed, info=info, split=split)
return _interleave_map_style_datasets(
datasets, probabilities, seed, info=info, split=split, stopping_strategy=stopping_strategy
)
else:
return _interleave_iterable_datasets(datasets, probabilities, seed, info=info, split=split)

Expand Down
35 changes: 35 additions & 0 deletions tests/test_arrow_dataset.py
Expand Up @@ -2691,6 +2691,41 @@ def test_interleave_datasets_probabilities():
)


def test_interleave_datasets_oversampling_strategy():
d1 = Dataset.from_dict({"a": [0, 1, 2]})
d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
d3 = Dataset.from_dict({"a": [22, 21, 20]}).select([2, 1, 0])
dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
expected_length = 3 * max(len(d1), len(d2), len(d3))
expected_values = [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 20] # hardcoded
assert isinstance(dataset, Dataset)
assert len(dataset) == expected_length
assert dataset["a"] == expected_values
assert dataset._fingerprint == interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")._fingerprint


def test_interleave_datasets_probabilities_oversampling_strategy():
seed = 42
probabilities = [0.3, 0.5, 0.2]
d1 = Dataset.from_dict({"a": [0, 1, 2]})
d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
d3 = Dataset.from_dict({"a": [22, 21, 20]}).select([2, 1, 0])
dataset = interleave_datasets(
[d1, d2, d3], stopping_strategy="all_exhausted", probabilities=probabilities, seed=seed
)
expected_length = 16 # hardcoded
expected_values = [10, 11, 20, 12, 0, 21, 13, 10, 1, 11, 12, 22, 13, 20, 10, 2] # hardcoded
assert isinstance(dataset, Dataset)
assert len(dataset) == expected_length
assert dataset["a"] == expected_values
assert (
dataset._fingerprint
== interleave_datasets(
[d1, d2, d3], stopping_strategy="all_exhausted", probabilities=probabilities, seed=seed
)._fingerprint
)


@pytest.mark.parametrize(
"column, expected_dtype",
[(["a", "b", "c", "d"], "string"), ([1, 2, 3, 4], "int64"), ([1.0, 2.0, 3.0, 4.0], "float64")],
Expand Down

1 comment on commit dc5cb17

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007439 / 0.011353 (-0.003913) 0.003601 / 0.011008 (-0.007407) 0.028419 / 0.038508 (-0.010089) 0.029329 / 0.023109 (0.006219) 0.304262 / 0.275898 (0.028364) 0.365312 / 0.323480 (0.041832) 0.005509 / 0.007986 (-0.002477) 0.003049 / 0.004328 (-0.001280) 0.006542 / 0.004250 (0.002292) 0.039476 / 0.037052 (0.002424) 0.317091 / 0.258489 (0.058602) 0.357346 / 0.293841 (0.063505) 0.028635 / 0.128546 (-0.099911) 0.009316 / 0.075646 (-0.066331) 0.247563 / 0.419271 (-0.171708) 0.045307 / 0.043533 (0.001774) 0.306492 / 0.255139 (0.051353) 0.334451 / 0.283200 (0.051251) 0.086131 / 0.141683 (-0.055551) 1.502759 / 1.452155 (0.050605) 1.525536 / 1.492716 (0.032820)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.206699 / 0.018006 (0.188693) 0.413534 / 0.000490 (0.413044) 0.002164 / 0.000200 (0.001964) 0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021419 / 0.037411 (-0.015992) 0.093210 / 0.014526 (0.078684) 0.104960 / 0.176557 (-0.071597) 0.153423 / 0.737135 (-0.583712) 0.107477 / 0.296338 (-0.188862)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.408006 / 0.215209 (0.192797) 4.072818 / 2.077655 (1.995164) 1.845180 / 1.504120 (0.341060) 1.649116 / 1.541195 (0.107921) 1.679185 / 1.468490 (0.210695) 0.439746 / 4.584777 (-4.145030) 3.378861 / 3.745712 (-0.366851) 2.922133 / 5.269862 (-2.347729) 1.469419 / 4.565676 (-3.096258) 0.052763 / 0.424275 (-0.371512) 0.010762 / 0.007607 (0.003155) 0.518447 / 0.226044 (0.292403) 5.215262 / 2.268929 (2.946334) 2.302504 / 55.444624 (-53.142120) 1.953024 / 6.876477 (-4.923453) 2.018906 / 2.142072 (-0.123166) 0.558307 / 4.805227 (-4.246921) 0.118172 / 6.500664 (-6.382492) 0.063136 / 0.075469 (-0.012333)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.506494 / 1.841788 (-0.335293) 12.363503 / 8.074308 (4.289195) 26.083794 / 10.191392 (15.892402) 0.852822 / 0.680424 (0.172398) 0.585696 / 0.534201 (0.051495) 0.345427 / 0.579283 (-0.233856) 0.392163 / 0.434364 (-0.042201) 0.235751 / 0.540337 (-0.304586) 0.234687 / 1.386936 (-1.152249)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005366 / 0.011353 (-0.005987) 0.003588 / 0.011008 (-0.007420) 0.027407 / 0.038508 (-0.011101) 0.027573 / 0.023109 (0.004464) 0.341270 / 0.275898 (0.065372) 0.400483 / 0.323480 (0.077003) 0.003165 / 0.007986 (-0.004821) 0.002922 / 0.004328 (-0.001407) 0.004629 / 0.004250 (0.000378) 0.036071 / 0.037052 (-0.000981) 0.343825 / 0.258489 (0.085336) 0.384389 / 0.293841 (0.090548) 0.027061 / 0.128546 (-0.101485) 0.009357 / 0.075646 (-0.066289) 0.253955 / 0.419271 (-0.165316) 0.049685 / 0.043533 (0.006152) 0.344158 / 0.255139 (0.089019) 0.381179 / 0.283200 (0.097980) 0.089823 / 0.141683 (-0.051860) 1.561191 / 1.452155 (0.109036) 1.537787 / 1.492716 (0.045071)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226212 / 0.018006 (0.208206) 0.403208 / 0.000490 (0.402719) 0.000919 / 0.000200 (0.000719) 0.000074 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021526 / 0.037411 (-0.015886) 0.093738 / 0.014526 (0.079212) 0.107818 / 0.176557 (-0.068739) 0.144425 / 0.737135 (-0.592710) 0.107652 / 0.296338 (-0.188686)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.433112 / 0.215209 (0.217903) 4.332347 / 2.077655 (2.254692) 2.051548 / 1.504120 (0.547428) 1.848485 / 1.541195 (0.307290) 1.864600 / 1.468490 (0.396110) 0.444168 / 4.584777 (-4.140609) 3.360297 / 3.745712 (-0.385416) 2.783901 / 5.269862 (-2.485961) 1.221242 / 4.565676 (-3.344435) 0.053027 / 0.424275 (-0.371248) 0.011051 / 0.007607 (0.003444) 0.537916 / 0.226044 (0.311872) 5.403006 / 2.268929 (3.134078) 2.487254 / 55.444624 (-52.957370) 2.144898 / 6.876477 (-4.731579) 2.234136 / 2.142072 (0.092063) 0.559702 / 4.805227 (-4.245525) 0.118103 / 6.500664 (-6.382561) 0.062601 / 0.075469 (-0.012868)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.560778 / 1.841788 (-0.281009) 12.312445 / 8.074308 (4.238137) 26.158208 / 10.191392 (15.966816) 0.910370 / 0.680424 (0.229947) 0.627040 / 0.534201 (0.092839) 0.345341 / 0.579283 (-0.233942) 0.390484 / 0.434364 (-0.043880) 0.229570 / 0.540337 (-0.310768) 0.237345 / 1.386936 (-1.149591)

CML watermark

Please sign in to comment.