-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
map and filter not working properly in multiprocessing with the new release 2.6.0 #5111
Comments
Same bug exists with |
Thanks for reporting, @loubnabnl and for the additional information, @partiallytyped. However, I'm not able to reproduce this issue, neither locally nor on Colab:
CC: @huggingface/datasets can anybody reproduce this? |
This is the minimum reproducible example. I ran this on the premium instances of colab.
In my case, the number of samples is correct, however, the samples selected when indexing are wrong. DatasetDict({
validation: Dataset({
features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
num_rows: 990
})
train: Dataset({
features: ['question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
num_rows: 7389
})
}) The number of rows is indeed correct, and i have checked it with a version that works. |
I can reproduce the issue on my mac too
But not on Colab with python 3.7, maybe related to python version? (didn't manage to install python 3.9)
|
I have the same issue, here's a simple notebook to reproduce: https://colab.research.google.com/drive/1Lvo9fg5DSpGUUgXW5JAutZ0bFsR-WV--?usp=sharing |
I think there are 2 different issues here:
|
Could you create another issue for the @partiallytyped one please ? Regarding the OP issue, I also tried on colab or locally on py3.7 or py3.10 but didn't reproduce |
I have created another issue for the one reported by @partiallytyped: |
I managed to reproduce your issue @loubnabnl on colab by upgrading pyarrow to 9.0.0 instead of 6.0.1 |
I managed to have a super minimal reproducible example: from datasets import Dataset, concatenate_datasets
ds = concatenate_datasets([Dataset.from_dict({"a": [i]}) for i in range(10)])
ds2 = ds.map(lambda _: {}, batched=True)
assert list(ds2) == list(ds) (filter uses a batched |
So finally it was related to PyArrow version! 👍 |
Doing a patch release asap :) |
Did the patch release yesterday, lmk if you still have issues |
It works now, thanks! |
Describe the bug
When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use
filter
, it's like only the samples from one worker are retrieved, one needs to specify the samenum_proc
in filter for it to work properly. This doesn't happen withdatasets
version 2.5.2In the code below the data is filtered differently when we increase
num_proc
used inmap
although the datsets before and after mapping have identical elements.Steps to reproduce the bug
Expected results
Increasing
num_proc
in mapping shouldn't alter filtering. With the previous version 2.5.2 this doesn't happenActual results
Filtering doesn't work properly when we increase
num_proc
in mapping but not when callingfilter
Environment info
datasets
version: 2.6.0The text was updated successfully, but these errors were encountered: