Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with numpy.load and allow_pickle #85

Open
swkeemink opened this issue Mar 6, 2020 · 9 comments
Open

Problems with numpy.load and allow_pickle #85

swkeemink opened this issue Mar 6, 2020 · 9 comments
Labels

Comments

@swkeemink
Copy link
Member

Expected behavior
Whenever an experiment is defined, it should check if a previous analysis was run and if there is data to be loaded. It should then show the message 'Reloading previously prepared data...'.

Actual behaviour
Even if previous analysis data exists, defining a new experiment now always redoes all of the analysis.

Reason
Currently if numpy.load is called, if allow_pickle is not set to True it will throw an error. In our current code this means that whenever an experiment is defined, all of the analysis will be re-done instead of loaded from the previously generated files. The easy fix is to just add 'allow_pickle=True' as an option.

This seems to be a new feature, and it's there for security reasons:

Allow loading pickled object arrays stored in npy files. Reasons for disallowing pickles include security, as loading pickled data can execute arbitrary code. If pickles are disallowed, loading object arrays will fail. Default: False

Changed in version 1.16.3: Made default False in response to CVE-2019-6446.

So we perhaps should think about fixing it in another way, without somehow breaking backwards compatibility with previously stored files!

@swkeemink
Copy link
Member Author

To avoid this type of bug we should probably avoid using the try and except structure, as well as write a test for them. This also applies to #82 in general.

@scottclowe
Copy link
Member

I find it very peculiar that they changed the default for np.load to be allow_pickle=False, but left the default for np.save as allow_pickle=True. This will obviously cause a lot of problems for people using the default options!

@swkeemink
Copy link
Member Author

Indeed. And it is very problematic for backwards compatibility... (It does sound like it's a good idea not to allow_pickle, if it would be able to run arbitrary code!)

@scottclowe
Copy link
Member

I think the advantage of saving the files with pickle enabled is it compresses them, which is still desirable.

@scottclowe
Copy link
Member

So I'm not sure what we should do long term. I think it would be good to add an option so the user can pick, and for the moment the default behaviour should be backwards compatible. Maybe we should change the default behaviour in the future though.

@swkeemink
Copy link
Member Author

Long term we should use HDF5 by default, with the option for numpy to preserve backward compatibility (this depends on finishing #82).

@scottclowe
Copy link
Member

scottclowe commented May 4, 2020

The bug was fixed by #111, restoring intended behaviour, but we may still need to implement a long term solution.

@swkeemink
Copy link
Member Author

@scottclowe , we can close this now with your changes to the caching backend right?

@scottclowe
Copy link
Member

I have not worked on anything which fixes the potential pickle vulnerability.

If we are happy to allow this vulnerability to persist on the basis that users are only unpickling their own cache so they are not exposing themselves to any external threats, you can close this on the basis that it was resolved by #111.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants