Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map cache does not detect function changes in another module #6184

Closed
jonathanasdf opened this issue Aug 25, 2023 · 2 comments
Closed

Map cache does not detect function changes in another module #6184

jonathanasdf opened this issue Aug 25, 2023 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@jonathanasdf
Copy link

jonathanasdf commented Aug 25, 2023

# dataset.py
import os
import datasets

if not os.path.exists('/tmp/test.json'):
  with open('/tmp/test.json', 'w') as file:
    file.write('[{"text": "hello"}]')

def transform(example):
  text = example['text']
  # text += ' world'
  return {'text': text}

data = datasets.load_dataset('json', data_files=['/tmp/test.json'], split='train')
data = data.map(transform)
# test.py
import dataset
print(next(iter(dataset.data)))

Initialize cache

python3 test.py
# {'text': 'hello'}

Edit dataset.py and uncomment the commented line, run again

python3 test.py
# {'text': 'hello'}
# expected: {'text': 'hello world'}

Clear cache and run again

rm -rf ~/.cache/huggingface/datasets/*
python3 test.py
# {'text': 'hello world'}

If instead the two files are combined, then changes to the function are detected correctly. But it's expected when working on any realistic codebase that things will be modularized into separate files.

@mariosasko mariosasko added the duplicate This issue or pull request already exists label Aug 29, 2023
@mariosasko
Copy link
Collaborator

This issue is a duplicate of #3297. This is a limitation of dill, a package we use for caching (non-__main__ module objects are serialized by reference). You can find more info about it here: uqfoundation/dill#424.

In your case, moving

data = datasets.load_dataset('json', data_files=['/tmp/test.json'], split='train')
data = data.map(transform)

to test.py and setting transform.__module__ = None at the end of dataset.py should fix the issue.

@jonathanasdf
Copy link
Author

jonathanasdf commented Aug 29, 2023

I understand this may be a limitation of an upstream tool, but for a user for datasets this is very annoying, as when you have dozens of different datasets with different preprocessing functions you can't really move them all into the same file. It may be worth seeing if there is a way to specialize the dependency (eg. subclass it) and enforce behaviors that makes sense for your product.

I was able to work around this for now by setting __module__ = None. If such workarounds are required for now it may be better to document it somewhere than a single obscure issue from a long time ago.

As this is a duplicate issue I'm closing it.

I have another issue with the cache #6179 can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants