Map cache does not detect function changes in another module #6184

jonathanasdf · 2023-08-25T22:59:14Z

# dataset.py
import os
import datasets

if not os.path.exists('/tmp/test.json'):
  with open('/tmp/test.json', 'w') as file:
    file.write('[{"text": "hello"}]')

def transform(example):
  text = example['text']
  # text += ' world'
  return {'text': text}

data = datasets.load_dataset('json', data_files=['/tmp/test.json'], split='train')
data = data.map(transform)

# test.py
import dataset
print(next(iter(dataset.data)))

Initialize cache

python3 test.py
# {'text': 'hello'}

Edit dataset.py and uncomment the commented line, run again

python3 test.py
# {'text': 'hello'}
# expected: {'text': 'hello world'}

Clear cache and run again

rm -rf ~/.cache/huggingface/datasets/*
python3 test.py
# {'text': 'hello world'}

If instead the two files are combined, then changes to the function are detected correctly. But it's expected when working on any realistic codebase that things will be modularized into separate files.

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-08-29T14:15:34Z

This issue is a duplicate of #3297. This is a limitation of dill, a package we use for caching (non-__main__ module objects are serialized by reference). You can find more info about it here: uqfoundation/dill#424.

In your case, moving

data = datasets.load_dataset('json', data_files=['/tmp/test.json'], split='train')
data = data.map(transform)

to test.py and setting transform.__module__ = None at the end of dataset.py should fix the issue.

jonathanasdf · 2023-08-29T20:56:49Z

I understand this may be a limitation of an upstream tool, but for a user for datasets this is very annoying, as when you have dozens of different datasets with different preprocessing functions you can't really move them all into the same file. It may be worth seeing if there is a way to specialize the dependency (eg. subclass it) and enforce behaviors that makes sense for your product.

I was able to work around this for now by setting __module__ = None. If such workarounds are required for now it may be better to document it somewhere than a single obscure issue from a long time ago.

As this is a duplicate issue I'm closing it.

I have another issue with the cache #6179 can you take a look?

mariosasko added the duplicate This issue or pull request already exists label Aug 29, 2023

jonathanasdf closed this as completed Aug 29, 2023

mariosasko mentioned this issue Apr 4, 2024

Dataset on Hub re-downloads every time? #6773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map cache does not detect function changes in another module #6184

Map cache does not detect function changes in another module #6184

jonathanasdf commented Aug 25, 2023 •

edited

mariosasko commented Aug 29, 2023

jonathanasdf commented Aug 29, 2023 •

edited

Map cache does not detect function changes in another module #6184

Map cache does not detect function changes in another module #6184

Comments

jonathanasdf commented Aug 25, 2023 • edited

mariosasko commented Aug 29, 2023

jonathanasdf commented Aug 29, 2023 • edited

jonathanasdf commented Aug 25, 2023 •

edited

jonathanasdf commented Aug 29, 2023 •

edited