map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

loubnabnl · 2022-10-13T17:00:55Z

Describe the bug

When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use filter , it's like only the samples from one worker are retrieved, one needs to specify the same num_proc in filter for it to work properly. This doesn't happen with datasets version 2.5.2

In the code below the data is filtered differently when we increase num_proc used in map although the datsets before and after mapping have identical elements.

Steps to reproduce the bug

import datasets
from datasets import load_dataset

def preprocess(example):
    return example

ds = load_dataset("codeparrot/codeparrot-clean-valid", split="train").select([i for i in range(10)])
ds1 = ds.map(preprocess, num_proc=2)
ds2 = ds.map(preprocess)

# the datasets elements are the same
for i in range(len(ds1)):
    assert ds1[i]==ds2[i]

print(f'Target column before filtering {ds1["autogenerated"]}')
print(f'Target column before filtering {ds2["autogenerated"]}')
print(f"datasets version {datasets.__version__}")

ds_filtered_1 = ds1.filter(lambda x: not x["autogenerated"])
ds_filtered_2 = ds2.filter(lambda x: not x["autogenerated"])

# all elements in Target column are false so they should all be kept, but for ds2 only the first 5=num_samples/num_proc are kept
print(ds_filtered_1)
print(ds_filtered_2)

Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Target column before filtering [False, False, False, False, False, False, False, False, False, False]

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 5
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

Expected results

Increasing num_proc in mapping shouldn't alter filtering. With the previous version 2.5.2 this doesn't happen

Actual results

Filtering doesn't work properly when we increase num_proc in mapping but not when calling filter

Environment info

datasets version: 2.6.0
Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
Python version: 3.9.13
PyArrow version: 8.0.0
Pandas version: 1.4.2

The text was updated successfully, but these errors were encountered:

m-rph · 2022-10-13T20:10:14Z

Same bug exists with num_proc=1 on colab. 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0]

albertvillanova · 2022-10-14T07:51:24Z

Thanks for reporting, @loubnabnl and for the additional information, @partiallytyped.

However, I'm not able to reproduce this issue, neither locally nor on Colab:

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

CC: @huggingface/datasets can anybody reproduce this?

m-rph · 2022-10-14T08:15:22Z

This is the minimum reproducible example. I ran this on the premium instances of colab.

# !pip install datasets
import datasets
from datasets import load_dataset
ds = load_dataset("copenlu/answerable_tydiqa").filter("english".__eq__, input_columns="language")
assert all(map("english".__eq__, ds["train"]["language"]))

In my case, the number of samples is correct, however, the samples selected when indexing are wrong.

DatasetDict({
    validation: Dataset({
        features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
        num_rows: 990
    })
    train: Dataset({
        features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
        num_rows: 7389
    })
})

The number of rows is indeed correct, and i have checked it with a version that works.

loubnabnl · 2022-10-14T08:32:11Z

I can reproduce the issue on my mac too

- `datasets` version: 2.6.0
- Platform: macOS-12.2.1-arm64-arm-64bit
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.4.3

But not on Colab with python 3.7, maybe related to python version? (didn't manage to install python 3.9)

- `datasets` version: 2.6.0
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.14
- PyArrow version: 9.0.0
- Pandas version: 1.3.5

Muennighoff · 2022-10-14T09:13:02Z

I have the same issue, here's a simple notebook to reproduce: https://colab.research.google.com/drive/1Lvo9fg5DSpGUUgXW5JAutZ0bFsR-WV--?usp=sharing

albertvillanova · 2022-10-14T10:26:03Z

I think there are 2 different issues here:

the one reported by @loubnabnl is related to multiprocessing in map and then filter; we should reproduce it first: I have tried with Python version 3.9.7 and I can't reproduce it either; maybe it is related to the version of PyArrow? To be checked.
the issue reported by @partiallytyped is related just to "filter" (without multiprocessing) and I can reproduce it.

lhoestq · 2022-10-14T10:26:59Z

Could you create another issue for the @partiallytyped one please ?

Regarding the OP issue, I also tried on colab or locally on py3.7 or py3.10 but didn't reproduce

albertvillanova · 2022-10-14T10:36:39Z

I have created another issue for the one reported by @partiallytyped:

Bug with filtered indices #5112

lhoestq · 2022-10-14T10:57:21Z

I managed to reproduce your issue @loubnabnl on colab by upgrading pyarrow to 9.0.0 instead of 6.0.1

lhoestq · 2022-10-14T11:06:57Z

I managed to have a super minimal reproducible example:

from datasets import Dataset, concatenate_datasets

ds = concatenate_datasets([Dataset.from_dict({"a": [i]}) for i in range(10)])
ds2 = ds.map(lambda _: {}, batched=True)
assert list(ds2) == list(ds)

(filter uses a batched map under the hood)

albertvillanova · 2022-10-14T11:43:46Z

the one reported by @loubnabnl is related to multiprocessing in map and then filter; we should reproduce it first: I have tried with Python version 3.9.7 and I can't reproduce it either; maybe it is related to the version of PyArrow? To be checked.

So finally it was related to PyArrow version! 👍

lhoestq · 2022-10-14T15:09:09Z

Doing a patch release asap :)

lhoestq · 2022-10-15T07:31:36Z

Did the patch release yesterday, lmk if you still have issues

loubnabnl · 2022-10-17T08:26:59Z

It works now, thanks!

loubnabnl added the bug Something isn't working label Oct 13, 2022

albertvillanova mentioned this issue Oct 14, 2022

Bug with filtered indices #5112

Closed

lhoestq self-assigned this Oct 14, 2022

lhoestq mentioned this issue Oct 14, 2022

Fix filter indices when batched #5113

Merged

lhoestq mentioned this issue Oct 14, 2022

Fix iter_batches #5115

Merged

2 tasks

lhoestq closed this as completed in #5115 Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

loubnabnl commented Oct 13, 2022 •

edited

m-rph commented Oct 13, 2022

albertvillanova commented Oct 14, 2022

m-rph commented Oct 14, 2022 •

edited

loubnabnl commented Oct 14, 2022 •

edited

Muennighoff commented Oct 14, 2022

albertvillanova commented Oct 14, 2022

lhoestq commented Oct 14, 2022 •

edited

albertvillanova commented Oct 14, 2022

lhoestq commented Oct 14, 2022 •

edited

lhoestq commented Oct 14, 2022

albertvillanova commented Oct 14, 2022 •

edited

lhoestq commented Oct 14, 2022

lhoestq commented Oct 15, 2022

loubnabnl commented Oct 17, 2022

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

Comments

loubnabnl commented Oct 13, 2022 • edited

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

m-rph commented Oct 13, 2022

albertvillanova commented Oct 14, 2022

m-rph commented Oct 14, 2022 • edited

loubnabnl commented Oct 14, 2022 • edited

Muennighoff commented Oct 14, 2022

albertvillanova commented Oct 14, 2022

lhoestq commented Oct 14, 2022 • edited

albertvillanova commented Oct 14, 2022

lhoestq commented Oct 14, 2022 • edited

lhoestq commented Oct 14, 2022

albertvillanova commented Oct 14, 2022 • edited

lhoestq commented Oct 14, 2022

lhoestq commented Oct 15, 2022

loubnabnl commented Oct 17, 2022

loubnabnl commented Oct 13, 2022 •

edited

m-rph commented Oct 14, 2022 •

edited

loubnabnl commented Oct 14, 2022 •

edited

lhoestq commented Oct 14, 2022 •

edited

lhoestq commented Oct 14, 2022 •

edited

albertvillanova commented Oct 14, 2022 •

edited