Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

Closed
loubnabnl opened this issue Oct 13, 2022 · 14 comments · Fixed by #5115
Closed

map and filter not working properly in multiprocessing with the new release 2.6.0 #5111

loubnabnl opened this issue Oct 13, 2022 · 14 comments · Fixed by #5115
Assignees
Labels
bug Something isn't working

Comments

@loubnabnl
Copy link

loubnabnl commented Oct 13, 2022

Describe the bug

When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use filter , it's like only the samples from one worker are retrieved, one needs to specify the same num_proc in filter for it to work properly. This doesn't happen with datasets version 2.5.2

In the code below the data is filtered differently when we increase num_proc used in map although the datsets before and after mapping have identical elements.

Steps to reproduce the bug

import datasets
from datasets import load_dataset

def preprocess(example):
    return example

ds = load_dataset("codeparrot/codeparrot-clean-valid", split="train").select([i for i in range(10)])
ds1 = ds.map(preprocess, num_proc=2)
ds2 = ds.map(preprocess)

# the datasets elements are the same
for i in range(len(ds1)):
    assert ds1[i]==ds2[i]

print(f'Target column before filtering {ds1["autogenerated"]}')
print(f'Target column before filtering {ds2["autogenerated"]}')
print(f"datasets version {datasets.__version__}")

ds_filtered_1 = ds1.filter(lambda x: not x["autogenerated"])
ds_filtered_2 = ds2.filter(lambda x: not x["autogenerated"])

# all elements in Target column are false so they should all be kept, but for ds2 only the first 5=num_samples/num_proc are kept
print(ds_filtered_1)
print(ds_filtered_2)
Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Target column before filtering [False, False, False, False, False, False, False, False, False, False]

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 5
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

Expected results

Increasing num_proc in mapping shouldn't alter filtering. With the previous version 2.5.2 this doesn't happen

Actual results

Filtering doesn't work properly when we increase num_proc in mapping but not when calling filter

Environment info

  • datasets version: 2.6.0
  • Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
  • Python version: 3.9.13
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2
@loubnabnl loubnabnl added the bug Something isn't working label Oct 13, 2022
@m-rph
Copy link

m-rph commented Oct 13, 2022

Same bug exists with num_proc=1 on colab. 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0]

@albertvillanova
Copy link
Member

Thanks for reporting, @loubnabnl and for the additional information, @partiallytyped.

However, I'm not able to reproduce this issue, neither locally nor on Colab:

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

CC: @huggingface/datasets can anybody reproduce this?

@m-rph
Copy link

m-rph commented Oct 14, 2022

This is the minimum reproducible example. I ran this on the premium instances of colab.

# !pip install datasets
import datasets
from datasets import load_dataset
ds = load_dataset("copenlu/answerable_tydiqa").filter("english".__eq__, input_columns="language")
assert all(map("english".__eq__, ds["train"]["language"]))

In my case, the number of samples is correct, however, the samples selected when indexing are wrong.

DatasetDict({
    validation: Dataset({
        features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
        num_rows: 990
    })
    train: Dataset({
        features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
        num_rows: 7389
    })
})

The number of rows is indeed correct, and i have checked it with a version that works.

@loubnabnl
Copy link
Author

loubnabnl commented Oct 14, 2022

I can reproduce the issue on my mac too

- `datasets` version: 2.6.0
- Platform: macOS-12.2.1-arm64-arm-64bit
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.4.3

But not on Colab with python 3.7, maybe related to python version? (didn't manage to install python 3.9)

- `datasets` version: 2.6.0
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.14
- PyArrow version: 9.0.0
- Pandas version: 1.3.5

@Muennighoff
Copy link
Contributor

I have the same issue, here's a simple notebook to reproduce: https://colab.research.google.com/drive/1Lvo9fg5DSpGUUgXW5JAutZ0bFsR-WV--?usp=sharing

@albertvillanova
Copy link
Member

I think there are 2 different issues here:

  • the one reported by @loubnabnl is related to multiprocessing in map and then filter; we should reproduce it first: I have tried with Python version 3.9.7 and I can't reproduce it either; maybe it is related to the version of PyArrow? To be checked.
  • the issue reported by @partiallytyped is related just to "filter" (without multiprocessing) and I can reproduce it.

@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2022

Could you create another issue for the @partiallytyped one please ?

Regarding the OP issue, I also tried on colab or locally on py3.7 or py3.10 but didn't reproduce

@albertvillanova
Copy link
Member

I have created another issue for the one reported by @partiallytyped:

@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2022

I managed to reproduce your issue @loubnabnl on colab by upgrading pyarrow to 9.0.0 instead of 6.0.1

@lhoestq lhoestq self-assigned this Oct 14, 2022
@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2022

I managed to have a super minimal reproducible example:

from datasets import Dataset, concatenate_datasets

ds = concatenate_datasets([Dataset.from_dict({"a": [i]}) for i in range(10)])
ds2 = ds.map(lambda _: {}, batched=True)
assert list(ds2) == list(ds)

(filter uses a batched map under the hood)

@albertvillanova
Copy link
Member

albertvillanova commented Oct 14, 2022

the one reported by @loubnabnl is related to multiprocessing in map and then filter; we should reproduce it first: I have tried with Python version 3.9.7 and I can't reproduce it either; maybe it is related to the version of PyArrow? To be checked.

So finally it was related to PyArrow version! 👍

@lhoestq lhoestq mentioned this issue Oct 14, 2022
2 tasks
@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2022

Doing a patch release asap :)

@lhoestq
Copy link
Member

lhoestq commented Oct 15, 2022

Did the patch release yesterday, lmk if you still have issues

@loubnabnl
Copy link
Author

It works now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants