Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.map() cache is wrongfully reused - only happens when the mapping function is imported #3297

Open
eladsegal opened this issue Nov 19, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@eladsegal
Copy link
Contributor

eladsegal commented Nov 19, 2021

Describe the bug

When .map is used with a mapping function that is imported, the cache is reused even if the mapping function has been modified.
The reason for this is that dill that is used for creating the fingerprint pickles imported functions by reference.

I guess it is not a widespread case, but it can still lead to unwanted results unnoticeably.

Steps to reproduce the bug

Create files a.py and b.py:

# a.py
from datasets import load_dataset

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

def mapping_func(examples):
    ID_LENGTH = 4
    examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
    return examples

if __name__ == "__main__":
    main()
# b.py
from datasets import load_dataset
from a import mapping_func

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

if __name__ == "__main__":
    main()

Run python b.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".
Now change ID_LENGTH to another number in order to change the mapping function, and run python b.py again. You'll see that .map loads from the cache the result of the previous mapping function.

Expected results

Run python a.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".
Now change ID_LENGTH to another number in order to change the mapping function, and run python a.py again. You'll see that the dataset is being processed and that there's no reuse of the previous mapping function result.

Workaround

Put the mapping function inside a dummy class as a static method:

# a.py
class MappingFuncClass:
    @staticmethod
    def mapping_func(examples):
        ID_LENGTH = 4
        examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
        return examples
# b.py
from datasets import load_dataset
from a import MappingFuncClass

def main():
    squad = load_dataset("squad")
    squad.map(MappingFuncClass.mapping_func, batched=True)

if __name__ == "__main__":
    main()

Environment info

  • datasets version: 1.15.1
  • Platform: Linux-4.4.0-19041-Microsoft-x86_64-with-glibc2.17
  • Python version: 3.8.10
  • PyArrow version: 4.0.1
@eladsegal eladsegal added the bug Something isn't working label Nov 19, 2021
@lhoestq
Copy link
Member

lhoestq commented Dec 6, 2021

Hi ! Thanks for reporting. Indeed this is a current limitation of the usage we have of dill in datasets. I'd suggest you use your workaround for now until we find a way to fix this. Maybe functions that are not coming from a module not installed with pip should be dumped completely, rather than only taking their locations into account

@eladsegal
Copy link
Contributor Author

I agree. Sounds like a solution for it would be pretty dirty, even cloudpickle doesn't help in this case.
In the meanwhile I think that adding a warning and the workaround somewhere in the documentation can be helpful.

@eladsegal
Copy link
Contributor Author

For anyone interested, I see that with dill==0.3.6 the workaround I suggested doesn't work anymore.
I opened an issue about it: uqfoundation/dill#572.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants