You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When .map is used with a mapping function that is imported, the cache is reused even if the mapping function has been modified.
The reason for this is that dill that is used for creating the fingerprint pickles imported functions by reference.
I guess it is not a widespread case, but it can still lead to unwanted results unnoticeably.
Run python b.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".
Now change ID_LENGTH to another number in order to change the mapping function, and run python b.py again. You'll see that .map loads from the cache the result of the previous mapping function.
Expected results
Run python a.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".
Now change ID_LENGTH to another number in order to change the mapping function, and run python a.py again. You'll see that the dataset is being processed and that there's no reuse of the previous mapping function result.
Workaround
Put the mapping function inside a dummy class as a static method:
Hi ! Thanks for reporting. Indeed this is a current limitation of the usage we have of dill in datasets. I'd suggest you use your workaround for now until we find a way to fix this. Maybe functions that are not coming from a module not installed with pip should be dumped completely, rather than only taking their locations into account
I agree. Sounds like a solution for it would be pretty dirty, even cloudpickle doesn't help in this case.
In the meanwhile I think that adding a warning and the workaround somewhere in the documentation can be helpful.
Describe the bug
When
.map
is used with a mapping function that is imported, the cache is reused even if the mapping function has been modified.The reason for this is that
dill
that is used for creating the fingerprint pickles imported functions by reference.I guess it is not a widespread case, but it can still lead to unwanted results unnoticeably.
Steps to reproduce the bug
Create files
a.py
andb.py
:Run
python b.py
twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".Now change
ID_LENGTH
to another number in order to change the mapping function, and runpython b.py
again. You'll see that.map
loads from the cache the result of the previous mapping function.Expected results
Run
python a.py
twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...".Now change
ID_LENGTH
to another number in order to change the mapping function, and runpython a.py
again. You'll see that the dataset is being processed and that there's no reuse of the previous mapping function result.Workaround
Put the mapping function inside a dummy class as a static method:
Environment info
datasets
version: 1.15.1The text was updated successfully, but these errors were encountered: