Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to force by-value serialization of classes? #424

Open
benesch opened this issue Aug 8, 2021 · 14 comments
Open

Ability to force by-value serialization of classes? #424

benesch opened this issue Aug 8, 2021 · 14 comments

Comments

@benesch
Copy link
Contributor

benesch commented Aug 8, 2021

@mmckerns, would you be up for adding a feature to dill that allows forcing a particular class to be serialized by value rather than by reference? Specifying byref=False only applies to a limited set of cases at the moment.

The specific motivation is pulumi/pulumi#7453, in which it is desired to have the pickled class always serialized by value, regardless of whether it is in __main__ or a different module.

cloudpickle recently added a feature that allows this sort of control: cloudpipe/cloudpickle#417. (But there are other reasons that cloudpickle doesn't work for Pulumi.)

@mmckerns
Copy link
Member

mmckerns commented Aug 8, 2021

@benesch: Does this thread have any relevance for what you are looking for? #47
If so, I believe there's been a solution that was added years ago, but it's been "commented out" due to causing a measurable speed hit. Let me know if you see exactly what you are looking for in the issue/PR thread.

@benesch
Copy link
Contributor Author

benesch commented Aug 8, 2021

Thanks for the pointer. Squinting at that code, I don't think that's quite what we're after. (Though note I am not an expert on pickling, and am still relatively confused about what save_reduce does.) AFAICT #47 will still cause the module containing the type to be imported? Whereas I think what I'm after is the ability to extend this check

elif obj.__module__ == '__main__':

to modules beside __main__.

lukehoban added a commit to pulumi/pulumi that referenced this issue Aug 12, 2021
…`__main__`

The underlying library `dill` that we use for serializing dynamic providers into Pulumi state for Python dynamic providers serializes classes differently depending on whether they are in `__main__` or in another module.  We need the by-value serialization to be applied in all cases.

uqfoundation/dill#424 is tracking adding the ability into `dill` to specify this by-value serialization explicitly, but until then, we will temporarily re-write the `__module__` of a provder class prior to serialization, so that `dill` behaves as we need for the dynamic provider use case.

Fixes #7453.
lukehoban added a commit to pulumi/pulumi that referenced this issue Aug 13, 2021
…`__main__` (#7755)

The underlying library `dill` that we use for serializing dynamic providers into Pulumi state for Python dynamic providers serializes classes differently depending on whether they are in `__main__` or in another module.  We need the by-value serialization to be applied in all cases.

uqfoundation/dill#424 is tracking adding the ability into `dill` to specify this by-value serialization explicitly, but until then, we will temporarily re-write the `__module__` of a provder class prior to serialization, so that `dill` behaves as we need for the dynamic provider use case.

Fixes #7453.
@kostaleonard
Copy link

I'm having similar issues. I've left a couple minimal examples to reproduce the issue in #375 which appears to be a duplicate.

@anivegesana
Copy link
Contributor

anivegesana commented May 31, 2022

@mmckerns In #463, @leogama made a breaking change to _locate_function that dropped the session keyword argument and added a pickler one. Since no issues have been opened concerning this, I think it is safe to say that no downstream library requires _locate_function and we can change it as we please.

It might be worthwhile moving _locate_function into dill._dill._Pickler. That way, there is a customizable way to tell dill whether or not to include a function or class in a pickle.

@mmckerns
Copy link
Member

mmckerns commented May 31, 2022

I remember seeing the breaking change in #463, and meant to complain about it but apparently forgot...

I disagree that it means that no downstream library depends on _location_function. I can only assume that the use of the session keyword was sparse at best. I'll be happy to look at any proposals you have with regard to the above.

@anivegesana
Copy link
Contributor

anivegesana commented May 31, 2022

Leaving the function there but moving the code over would be a way to get this to work while not breaking code that uses _locate_function.

def _locate_function(obj, pickler=None):
    if pickler and is_dill(pickler, obj):
        return pickler._byref_selector(obj)
    return Pickler._byref_selector(pickler, obj)

class Pickler(StockPickler):
    ...
    def _byref_selector(pickler, obj):
        if obj.__module__ in ['__main__', None] or \
                pickler and pickler._session and obj.__module__ == pickler._main.__name__:
            return False

        found = _import_module(obj.__module__ + '.' + obj.__name__, safe=True)
        return found is obj

Although a cleaner interface might be desired, or otherwise this will become a common boilerplate pattern:

class SelectivePickler(dill.Pickler):
    def __init__(self, *args, **kwargs):
        super().__init__(self, *args, **kwargs)
        self._byref_names = collections.defaultdict(list)
    def _byref_selector(self, obj):
        for names in self._byref_names[obj.__module__]:
            if names is None:
                return False
            for name in names:
                if obj.__qualname__ == name or obj.__qualname__.startswith(name + '.'):
                    return False
        return super()._byref_selector(obj)
    def register_byref(self, obj):
        if isinstance(obj, types.ModuleType):
            self._byref_names[obj.__name__] = None
        else:
            self._byref_names[obj.__module__].append(obj.__qualname__)

@leogama
Copy link
Contributor

leogama commented Jun 3, 2022

@anivegesana, in my defense... Just kidding. 😄 It's my first experience on tweaking the internals of a broadly used open source project (and Python's ultra-dynamical nature doesn't help). It never occurred to me that changing such a simple private function could possibly break a downstream library —and I hope it didn't.

Is anything really private in dill? Where does the line get drawn?

@anivegesana
Copy link
Contributor

It isn't fully clear to me either. I just go out of my way to not remove any parameters and make minimal changes to the code, even if there are ways that would be easier to understand, just so that whatever was there originally stays there (hence why I didn't notice that #466 was a bug; I had simply left it like I found it.)

This is why I think we should have a clear standard about which functions are public and which ones are private (preferably following PEP 8) so that it would be easier to remove cruft and improve the library. We should expose any functions that the public can use in the main dill package or the other non-_dill packages without _ prefixes and make it clear that the internals of _dill are unstable unless they can be imported from somewhere else in dill.

@mmckerns
Copy link
Member

mmckerns commented Jun 4, 2022

Well, I think it's more complex than that. When dill serializes an object, it also can require the reducing function (found in _dill) to be serialized, which means it can also then require other elements of the dill module to be serialized. People shouldn't use anything in the internal module dill._dill, they should be using what's exposed in dill proper. Even then, changing internals can break the ability to read stored pickles.

@leogama
Copy link
Contributor

leogama commented Jun 4, 2022

Edited

It's like having three four API levels. Here is how I'm making sense of it (correct if I'm wrong):

  1. Public external API — objects exposed either via imports in the dill main module, or via inclusion in __all__ for the public submodules. When evaluating potential changes, new features or usability may win over compatibility. Bug fixes always win over compatibility (workarounds shall be sought to avoid this);
  2. Public internal API, or "unpickling" API — internal objects used by constructors, must keep backward compatibility with pickles generated by older python+dill combinations. Compatibility always wins over code tidiness, but may lose to critical new features;
  3. Private class API (PEP 8) — class methods and attributes starting with _ (single underscore) and special methods (named like __<method>__). Can change more often than Public external API;
  4. Private API — everything else, including class methods and attributes starting with __ (double underscore) that are not special methods. Could change at any time (in theory, downstream package devs may squeak).

By the way, I indeed remember checking if _locate_function() wasn't used for unpickling before changing it.

@anivegesana
Copy link
Contributor

I would add that under Pickler, Unpickler and all other publicly available classes, all methods and attributes starting with _ are also protected (technically public, but should only be used by subclasses,) but ones starting with __ are private to conform to PEP 8.

@leogama
Copy link
Contributor

leogama commented Jun 5, 2022

@anivegesana and @mmckerns: I was thinking about how to solve another issue yesterday and realized that I probably could have changed the function _locate_function() without braking backward compatibility, by converting this:

dill/dill/_dill.py

Lines 1063 to 1067 in 5bd56a8

def _locate_function(obj, session=False):
if obj.__module__ in ['__main__', None]: # and session:
return False
found = _import_module(obj.__module__ + '.' + obj.__name__, safe=True)
return found is obj

to this:

class Pickler(StockPickler):
    # ...
    def _locate_function(self, obj, session=False):
         if obj.__module__ in ['__main__', None] or \
                self is not None and self._session and obj.__module__ == self._main.__name__:
            return False
        found = _import_module(obj.__module__ + '.' + obj.__name__, safe=True)
        return found is obj

_locate_function = functools.partial(Pickler._locate_function, None)

and then simply prepending self. to the calls:

self._locate_function(obj)

I leave the suggestion to implement new private functions used only for pickling xor unpickling, that at first don't depend on external state, as @staticmethods, but always call them using self. This way, if at any moment they need to be modified to inspect an attribute of the pickler or the unpickler, it's easy to convert them to instance methods without changing the API. Or would it be better to make them instance methods from the beginning, to force subclasses to call them with self —even if they end up being instance methods that completely ignore the instance forever? Anyway...

Edit:

Actually the above suggestion is a bad idea, as it would expose such private functions to subclasses (check the updated API hierarchy in my previous comment, thanks to @anivegesana's comment).

Edit 2:

@anivegesana, just now I read your comment above with a similar solution for not breaking _locate_function() 🤦🏼‍♂️

@leogama
Copy link
Contributor

leogama commented Jun 5, 2022

dill's API privacy hierarchy

  1. Public external API — objects exposed either via imports in the dill main module, or via inclusion in __all__ for the public submodules. When evaluating potential changes, new features or usability may win over compatibility. Bug fixes always win over compatibility (workarounds shall be sought to avoid this);

  2. Public internal API, or "unpickling" API — internal objects used by constructors, must keep backward compatibility with pickles generated by older python+dill combinations. Compatibility always wins over code tidiness, but may lose to critical new features;

  3. Private class API (PEP 8) — class methods and attributes starting with _ (single underscore) and special methods (named like __<method>__). Can change more often than Public external API;

  4. Private API — everything else, including class methods and attributes starting with __ (double underscore) that are not special methods. Could change at any time (in theory, downstream package devs may squeak).

@mmckerns, would you mind to add this, with any modifications, to the DEV_NOTES file? This division could also be used to stratify the changes in release notes.

@miraculixx
Copy link

miraculixx commented Jun 20, 2023

would you be up for adding a feature to dill that allows forcing a particular class to be serialized by value rather than by reference?

I found two ways to achieve this:

  1. "mainify" approach, as documented in [1]
  2. in the module where the class is defined, prefix the following line of code (first line of code)
    __name__ = '__code__' if __name__ != '__main__' else __name__

Both approaches effectively trigger dill to serialize the object by value, as if it were defined in __main__. Note I have chosen "__code__" as a name because it is unlikely to be an existing module. In my tests as long as __name__ is defined so that the module does not actually exist at the time of calling dill.dump(), dill will serialize by value [2].

Word of caution: While I have not found any downsides to resetting __name__ it is probably not a smart idea in general. In particular it may interfere with other parts of your code and could have unexpected side effects. Also I'm not sure this is intentional behavior by dill. What do you think @mmckerns?

Tested versions:

  • dill==0.3.6
  • dill==0.3.5.1

Tested objects:

  • class
  • class instance
  • function
  • lambda
  • staticmethod

Python versions:

  • 3.9.16
  • 3.10.11
  • 3.11.4

[1] https://oegedijk.github.io/blog/pickle/dill/python/2020/11/10/serializing-dill-references.html
[2] the reason dill serializes by value in this case is because dill._dill._locate_function() returns False when the object cannot be imported from its __module__, which is the same return value as for objects defined in __main__.

found = _import_module(obj.__module__ + '.' + obj.__name__, safe=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants