Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__globals__ of function in deleted module restored inconsistently #532

Open
dfremont opened this issue Aug 1, 2022 · 11 comments · May be fixed by #534 or #536
Open

__globals__ of function in deleted module restored inconsistently #532

dfremont opened this issue Aug 1, 2022 · 11 comments · May be fixed by #534 or #536

Comments

@dfremont
Copy link

dfremont commented Aug 1, 2022

My application sometimes pickles functions after the module they were imported from has been removed from sys.modules (for reasons we hopefully don't have to get into). Usually dill handles this just fine, but there seems to be a bug when the pickled function is used in a closure which also maintains a reference to the pickled function's __globals__ dictionary. The following code yields an assertion failure with dill 0.3.5.1 (and also the latest version on master; in each case I was using Python 3.9):

import dill
import sys

import other    # another module; see below

def make():
    func = other.blah
    globs = func.__globals__
    def inner():
        assert globs is func.__globals__
    return inner

f = make()

del sys.modules['other']

f()     # assertion passes here
d = dill.dumps(f)
f2 = dill.loads(d)
f2()    # assertion fails here

where other.py contains the code:

def blah():
    return 42

The assertion fails because in the unpickled version of inner, the free variable func is bound to the unpickled version of blah but the free variable globs is bound to a dict which is not the globals dict of func (nor is it the globals dict of other.blah, in fact).

What seems to be happening is that when blah is unpickled, dill calls _create_function with the argument fglobals set to an empty dictionary, so that the line

func = FunctionType(fcode, fglobals or dict(), fname, fdefaults, fclosure)

creates a new dict to be the globals dict of the new function. Then when the "post-processing" updates the cells of inner, it sets the cell for globs to the old dictionary instead of this newly-created one. To confirm this diagnosis, I removed the or dict() from the line above, and changed the "F1" case of save_function to always set globs = globs_copy (i.e. removing the else branch where it sets globs = {'__name__': obj.__module__}), and indeed the problem went away. I'm not sure how to properly fix the issue, though.

@dfremont
Copy link
Author

dfremont commented Aug 1, 2022

Actually, you can trigger this problem without closures: if you pickle several imported functions as above, each will get its own global dict instead of sharing a single one. For example:

import dill
import sys

from other import a, b   # import any two functions

del sys.modules['other']

assert a.__globals__ is b.__globals__   # works

d = dill.dumps((a, b))
a2, b2 = dill.loads(d)

assert a2.__globals__ is b2.__globals__   # fails

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2022

@dfremont, thanks for reporting. @leogama: you may want to have a look at this due to current work on saving module dicts, @anivegesana: you should probably look at this due to the recursive post-processing code.

@anivegesana
Copy link
Contributor

anivegesana commented Aug 1, 2022

First section of code you mentioned:

I think it is doing it because deleting the module makes the functions and their global dictionaries inaccessible, so dill pickles a copy of it. Because the global dictionary of other is not accessible by running sys.modules['other'].__dict__, it cannot pickle a reference to that dictionary, and must instead pickle it in its entirety by value.~

Unfortunately, I am unaware of a fix to this. I am open to suggestions, but I don't think there is a way to pickle the reference to a global dictionary of a module that has been deleted.

Second section where both functions are pickled together in the same file:

It was a duplicate of #466. I wasn't working on it because it would be a pain to squeeze it around the more pressing software engineering changes that @leogama and @mmckerns were working on (regarding removing dead code from Python 2), so I just left it open. Before now, it was just a theoretical issue that could cause problems. Good to know that someone will use that feature when it is done.

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2022

It's not a good idea to delete from sys.modules. However, dill has code that handles changes to the set of imported modules... but it caused a big speed hit. See the code here:

# memorise all already imported modules. This implies that this must be

@dfremont
Copy link
Author

dfremont commented Aug 1, 2022

Thanks to both of you for your fast response (and all your work on dill)!

Unfortunately, I am unaware of a fix to this. I am open to suggestions, but I don't think there is a way to pickle the reference to a global dictionary of a module that has been deleted.

Is there a reason why the namespace of a module can't be treated like any ordinary dict in this case? Currently, dill is correctly pickling that dictionary by value: the problem is just that two references to that dictionary end up pointing to separate objects after unpickling. If the code can be modified somehow so that both of those references point to the same object, it would fix the problem I'm having. Of course the object they point to would not be the same one obtained by re-importing the missing module, but that's to be expected given the semantics of module deletion in Python (the old module object is kept alive as long as there are references to it, and importing the module again creates a new module object).

It was a duplicate of #466.

Ah, yes, sorry for missing that! That case does indeed matter for me, since I expect the unpickled functions to share the same global namespace dict (as they did before pickling) so that any mutations they make affect each other.

It's not a good idea to delete from sys.modules.

Agreed! But it is allowed, and unfortunately difficult to avoid in my application, so if there's a relatively easy way to fix this on the dill end that would be great.

However, dill has code that handles changes to the set of imported modules... but it caused a big speed hit.

I'm not understanding why it would be necessary to do anything complicated here. dill is already correctly recognizing that the dict in question can't be pickled as a global reference, and so deciding to pickle it by value. I'd be perfectly happy with the current behavior as long as all references to the old dictionary unpickled as pointing to the same pickled-by-value dict. But maybe I'm missing something?

@anivegesana
Copy link
Contributor

anivegesana commented Aug 4, 2022

The reason that the globals dictionary cannot be treated like a normal dictionary is a bit nuanced. The __globals__ attribute of functions is read only. This means that the globals dictionary needs to exist before the function can be created. This is fine in a .py file because the module is created before the function, but pickle tries pickling the function first and then the globals because it is an attribute of the function. We had to add special logic to copy the dictionary first and then pickle the function, but that results in a copy, not the original dictionary.

A future fix would keep track of all copies of global dictionaries created so that no global dictionary is copied more than once. This is a bit more complicated and would require a significant rewrite of the function pickling code. some minor changes. It seems like I made the major changes months ago, but just forgot about this issue because I didn't think it was too important.

I will open a PR tomorrow night.

@anivegesana anivegesana linked a pull request Aug 4, 2022 that will close this issue
@leogama
Copy link
Contributor

leogama commented Aug 7, 2022

First section of code you mentioned:

I think it is doing it because deleting the module makes the functions and their global dictionaries inaccessible, so dill pickles a copy of it. Because the global dictionary of other is not accessible by running sys.modules['other'].__dict__, it cannot pickle a reference to that dictionary, and must instead pickle it in its entirety by value.~

Unfortunately, I am unaware of a fix to this. I am open to suggestions, but I don't think there is a way to pickle the reference to a global dictionary of a module that has been deleted.

@anivegesana, actually, under the hood, what happens is a bit more complex. Consider a function func whose module of origin was removed from sys.modules. In save_function(func), _locate_function(func) is called, which in turn reloads func.__module__ and tries to find it. I've found two possible outcomes so far:

If the module is built-in, the reloaded module is different from the one that was removed from sys.modules, but at least the functions (or all objects?) in their namespaces are the same:

>>> import sys
>>> import math as math1
>>> del sys.modules['math']
>>> import math as math2
>>> math1 == math2
False
>>> math1.__dict__ is math2.__dict__
False
>>> math1.__dict__ == math2.__dict__
True
>>> math1.sin is math2.sin
True

If the module is not built-in, however, the functions' code is re-executed, so the objects don't match:

>>> import sys
>>> import statistics as stats1
>>> del sys.modules['statistics']
>>> import statistics as stats2
>>> stats1 == stats2
False
>>> stats1.__dict__ == stats2.__dict__
False
>>> stats1.mean == stats2.mean
False
>>> stats1.mean.__code__ is stats2.mean.__code__
False
>>> # But the function's code *is* identical:
>>> stats1.mean.__code__ == stats2.mean.__code__
True

I'd argue that, if the function (or class, etc.) would normally be saved by reference i.e. "as global", dill should be able to identify this edge case and still save it the same way.

@anivegesana
Copy link
Contributor

I think this issue is concerning whether or not all functions that shared the same globals dictionary still share the same globals dictionary as each other after unpickling, not that they have the same exact dictionary that was used during pickling. This seems impossible because there is no way to access the old globals dictionary if the pickle was unpickled in a different interpreter.

I fail to see why standard library modules would cause issues. If you have a case where this condition fails, I'd be happy to fix it.

>>> import dill, sys, math
>>> mdict = math.__dict__
>>> md1 = dill.copy(mdict)
>>> del sys.modules['math']
>>> import math
>>> mdict2 = math.__dict__
>>> md2 = dill.copy(mdict2)
>>> md1 is mdict
True
>>> md2 is mdict2
True
>>> md2 is md1
False

@leogama
Copy link
Contributor

leogama commented Aug 8, 2022

It's not a good idea to delete from sys.modules. However, dill has code that handles changes to the set of imported modules... but it caused a big speed hit.

Disclaimer: none of the following comments deal with the shared namespace problem. Anyway...

I've just opened a PR that would likely solve the problem for OP's specific examples, but there's yet another approach to unloaded modules: patching __import__ to add a import hook. A dictionary would cache the last version of every module imported regularly. I find it ugly, others call it "monkey patching", but it has some history and may be effective.

This would be inside dill._dill for example:

_modules_cache = sys.modules.copy()  # initialization
__import__ = __import__  # copy from __builtins__ to __dict__

@wraps(__import__)
def _import_hook(*args, **kwds):
    module = __import__(*args, **kwds)
    _modules_cache.update(sys.modules)
    return module

__builtin__.__import__ = _import_hook  # original __import__ is _import_hook.__wrapped__

It doesn't break anything in principle, it's like a decorator. All tests passed on both CPython and PyPy with this snippet included. Idea stolen from: https://stackoverflow.com/questions/14778407/do-something-every-time-a-module-is-imported/14778568#14778568

@mmckerns
Copy link
Member

mmckerns commented Aug 9, 2022

@leogama: I'm not sure if you looked at the code I was referencing (in __diff), but the idea there is similar.

@leogama
Copy link
Contributor

leogama commented Aug 9, 2022

@leogama: I'm not sure if you looked at the code I was referencing (in __diff), but the idea there is similar.

I saw the memorise code but didn't notice the __import__ patch at the bottom... 😅 You are right, that part is nearly identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants