Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP refactor func globals to make external delegation possible #430

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Jul 1, 2021

Prototype to tackle #411 without calling self.persistent_id on a object that would not otherwise be part of the pickle stream.

@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #430 (e5a18e3) into master (0c62ae0) will increase coverage by 0.18%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #430      +/-   ##
==========================================
+ Coverage   91.99%   92.17%   +0.18%     
==========================================
  Files           4        4              
  Lines         687      703      +16     
  Branches      141      140       -1     
==========================================
+ Hits          632      648      +16     
  Misses         34       34              
  Partials       21       21              
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 87.90% <100.00%> (+0.68%) ⬆️
cloudpickle/cloudpickle_fast.py 96.83% <100.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c62ae0...e5a18e3. Read the comment docs.

@staticmethod
def persistent_id(obj):
if (
isinstance(obj, _FuncMetadataGlobals)
Copy link
Contributor Author

@ogrisel ogrisel Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_FuncMetadataGlobals and _FuncCodeGlobals are still private API so the subclassing pickler would still be impacted by changes in those constructs. However it feels maybe less brittle than overriding the save_function or _dynamic_function_reduce methods.

Copy link

@faucct faucct Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is worse than my solution:

  • it is less intuitive as it forces you to use private API
  • it also brings unnecessary wrapping of all dicts, so you will have to maintain these wrappers forever

My solution does not bring any more APIs as it follows official python docs. It also does not complicate serialization for people who don't use persistent_id for func.__globals__.

Copy link
Contributor Author

@ogrisel ogrisel Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't like your solution because it's calling persistent_id with the global dict at a point where it's not supposed to be called: the full func globals dict is never pickled when persistent_id is not overridden, only the derived base_globals and f_globals are. Therefore your solution is a breach/abuse of the persistent_id protocol as defined in the pickle module documentation.

If we don't like the wrapping, we could instead inject a specific key inside the dict with the id of the full func.__globals__. But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).

Copy link

@faucct faucct Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the full func globals dict is never pickled when persistent_id is not overridden, only the derived base_globals and f_globals are.

Are you talking about cloudpickle or pickle? Because pickle pickles full func globals dict and we want to be able to tell cloudpickle that we want the same behaviour here.

Therefore your solution is a breach/abuse of the persistent_id protocol as defined in the pickle module documentation.

I don't see how it is. When you replace cloudpickle.CloudPickler with pickle.Pickler in my example you receive the desired behavior:

import io
import pickle

__globals__ = {"a": "foo"}


class Pickler(pickle.Pickler):
    @staticmethod
    def persistent_id(obj):
        if id(obj) == id(__globals__):
            return "__globals__"


class Unpickler(pickle.Unpickler):
    @staticmethod
    def persistent_load(pid):
        return {"__globals__": __globals__}[pid]


exec("""
def get():
    return a
""", __globals__)
get = __globals__['get']
file = io.BytesIO()
Pickler(file).dump(get)
dumped = file.getvalue()
if b'foo' in dumped:
    raise Exception(repr(dumped))
get = Unpickler(io.BytesIO(dumped)).load()
if id(__globals__) != id(get.__globals__):
    raise Exception(f"id(__globals__) != id(get.__globals__)")
print(f"get() == {get()}")
__globals__['a'] = 'bar'
print(f"get() == {get()}")

But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).

That's right, users don't need that info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you replace cloudpickle.CloudPickler with pickle.Pickler in my example you receive the desired behavior

Your example here works without defining persistent_id and persistent_load, so I am not sure it is relevant to why your changes are "abusing" the persistent_id protocol.

In [9]: import io
   ...: import pickle
   ...:
   ...: __globals__ = {"a": "foo"}
   ...:
   ...:
   ...: class Pickler(pickle.Pickler):
   ...:     pass
   ...:
   ...: class Unpickler(pickle.Unpickler):
   ...:     pass
   ...:
   ...: exec("""
   ...: def get():
   ...:     return a
   ...: """, __globals__)
   ...: get = __globals__['get']
   ...: file = io.BytesIO()
   ...: Pickler(file).dump(get)
   ...: dumped = file.getvalue()
   ...: if b'foo' in dumped:
   ...:     raise Exception(repr(dumped))
   ...: get = Unpickler(io.BytesIO(dumped)).load()
   ...: if id(__globals__) != id(get.__globals__):
   ...:     raise Exception(f"id(__globals__) != id(get.__globals__)")
   ...: print(f"get() == {get()}")
   ...: __globals__['a'] = 'bar'
   ...: print(f"get() == {get()}")

I'll try to explain one more time why your PR does not use persistent_id as it is intended to be used. According to the pickle documentation:

To pickle objects that have an external persistent ID, the pickler must have a custom persistent_id() method that takes an object as an argument and returns either None or the persistent ID for that object. When None is returned, the pickler simply pickles the object as normal. When a persistent ID string is returned, the pickler will pickle that object, along with a marker so that the unpickler will recognize it as a persistent ID.

In other words, persistent_id(obj) only has an impact on which obj is pickled. In your PR, persistent_id(func.__globals__) has an impact on the content of two different objects that are NOT func.__globals__: f_globals and base_globals. Such a case is never advertised by the pickle docs, and thus is an abuse of the persistent_id protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pickle pickles full func globals dict and we want to be able to tell cloudpickle that we want the same behaviour here

"pickles full globals dict" in which setting?

import pickle


global_var = 1


def f():
    return global_var


pickle.dumps(f)

does not pickle any global variable, or any dictionary containing global variables. It pickles f by reference.

@@ -49,6 +49,7 @@
from cloudpickle.cloudpickle import _make_empty_cell, cell_set
from cloudpickle.cloudpickle import _extract_class_dict, _whichmodule
from cloudpickle.cloudpickle import _lookup_module_and_qualname
from cloudpickle.cloudpickle import _FuncMetadataGlobals, _FuncCodeGlobals
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make those directly importable from the top level cloudpickle package even if they are private API?

@ogrisel ogrisel changed the title WIP refactor func globals to external delagation possible WIP refactor func globals to external delegation possible Jul 2, 2021
@ogrisel ogrisel changed the title WIP refactor func globals to external delegation possible WIP refactor func globals to make external delegation possible Jul 2, 2021
@pierreglaser
Copy link
Member

@ogrisel feel free to ping me when this is ready for review.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 6, 2021

@ogrisel feel free to ping me when this is ready for review.

I am not sure if we want merge this PR. If it does not solve anybodies problem, then it's not worth the added complexity. Maybe @faucct is best addressed by just overriding _dynamic_function_reduce as suggested in #411 (comment).

@faucct
Copy link

faucct commented Jul 6, 2021

@ogrisel feel free to ping me when this is ready for review.

I am not sure if we want merge this PR. If it does not solve anybodies problem, then it's not worth the added complexity. Maybe @faucct is best addressed by just overriding _dynamic_function_reduce as suggested in #411 (comment).

I don't like this solution. I will check if overriding _dynamic_function_reduce fits our needs and return.

@faucct
Copy link

faucct commented Jul 6, 2021

Also, what about putting .globals substituion behind a flag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants