Skip to content

Commit

Permalink
Add ability to register modules to be deeply serialized (#417)
Browse files Browse the repository at this point in the history

Co-authored-by: Pierre Glaser <pierreglaser@msn.com>
Co-authored-by: Samuel Hinton <sh@arenko.group>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
  • Loading branch information
4 people committed Aug 1, 2021
1 parent 0c62ae0 commit 343da11
Show file tree
Hide file tree
Showing 9 changed files with 503 additions and 29 deletions.
6 changes: 5 additions & 1 deletion CHANGES.md
Expand Up @@ -6,9 +6,13 @@ dev

- Python 3.5 is no longer supported.

- Support for registering modules to be serialised by value. This will
allow for code defined in local modules to be serialised and executed
remotely without those local modules installed on the remote machine.
([PR #417](https://github.com/cloudpipe/cloudpickle/pull/417))

- Fix a side effect altering dynamic modules at pickling time.
([PR #426](https://github.com/cloudpipe/cloudpickle/pull/426))

- Support for pickling type annotations on Python 3.10 as per [PEP 563](
https://www.python.org/dev/peps/pep-0563/)
([PR #400](https://github.com/cloudpipe/cloudpickle/pull/400))
Expand Down
53 changes: 53 additions & 0 deletions README.md
Expand Up @@ -66,6 +66,59 @@ Pickling a function interactively defined in a Python shell session
85
```


Overriding pickle's serialization mechanism for importable constructs:
----------------------------------------------------------------------

An important difference between `cloudpickle` and `pickle` is that
`cloudpickle` can serialize a function or class **by value**, whereas `pickle`
can only serialize it **by reference**. Serialization by reference treats
functions and classes as attributes of modules, and pickles them through
instructions that trigger the import of their module at load time.
Serialization by reference is thus limited in that it assumes that the module
containing the function or class is available/importable in the unpickling
environment. This assumption breaks when pickling constructs defined in an
interactive session, a case that is automatically detected by `cloudpickle`,
that pickles such constructs **by value**.

Another case where the importability assumption is expected to break is when
developing a module in a distributed execution environment: the worker
processes may not have access to the said module, for example if they live on a
different machine than the process in which the module is being developed.
By itself, `cloudpickle` cannot detect such "locally importable" modules and
switch to serialization by value; instead, it relies on its default mode,
which is serialization by reference. However, since `cloudpickle 1.7.0`, one
can explicitly specify modules for which serialization by value should be used,
using the `register_pickle_by_value(module)`/`/unregister_pickle(module)` API:

```python
>>> import cloudpickle
>>> import my_module
>>> cloudpickle.register_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by value
>>> cloudpickle.unregister_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by reference
```

Using this API, there is no need to re-install the new version of the module on
all the worker nodes nor to restart the workers: restarting the client Python
process with the new source code is enough.

Note that this feature is still **experimental**, and may fail in the following
situations:

- If the body of a function/class pickled by value contains an `import` statement:
```python
>>> def f():
>>> ... from another_module import g
>>> ... # calling f in the unpickling environment may fail if another_module
>>> ... # is unavailable
>>> ... return g() + 1
```

- If a function pickled by reference uses a function pickled by value during its execution.


Running the tests
-----------------

Expand Down
116 changes: 106 additions & 10 deletions cloudpickle/cloudpickle.py
Expand Up @@ -88,6 +88,9 @@ def g():
# communication speed over compatibility:
DEFAULT_PROTOCOL = pickle.HIGHEST_PROTOCOL

# Names of modules whose resources should be treated as dynamic.
_PICKLE_BY_VALUE_MODULES = set()

# Track the provenance of reconstructed dynamic classes to make it possible to
# reconstruct instances from the matching singleton class definition when
# appropriate and preserve the usual "isinstance" semantics of Python objects.
Expand Down Expand Up @@ -124,6 +127,77 @@ def _lookup_class_or_track(class_tracker_id, class_def):
return class_def


def register_pickle_by_value(module):
"""Register a module to make it functions and classes picklable by value.
By default, functions and classes that are attributes of an importable
module are to be pickled by reference, that is relying on re-importing
the attribute from the module at load time.
If `register_pickle_by_value(module)` is called, all its functions and
classes are subsequently to be pickled by value, meaning that they can
be loaded in Python processes where the module is not importable.
This is especially useful when developing a module in a distributed
execution environment: restarting the client Python process with the new
source code is enough: there is no need to re-install the new version
of the module on all the worker nodes nor to restart the workers.
Note: this feature is considered experimental. See the cloudpickle
README.md file for more details and limitations.
"""
if not isinstance(module, types.ModuleType):
raise ValueError(
f"Input should be a module object, got {str(module)} instead"
)
# In the future, cloudpickle may need a way to access any module registered
# for pickling by value in order to introspect relative imports inside
# functions pickled by value. (see
# https://github.com/cloudpipe/cloudpickle/pull/417#issuecomment-873684633).
# This access can be ensured by checking that module is present in
# sys.modules at registering time and assuming that it will still be in
# there when accessed during pickling. Another alternative would be to
# store a weakref to the module. Even though cloudpickle does not implement
# this introspection yet, in order to avoid a possible breaking change
# later, we still enforce the presence of module inside sys.modules.
if module.__name__ not in sys.modules:
raise ValueError(
f"{module} was not imported correctly, have you used an "
f"`import` statement to access it?"
)
_PICKLE_BY_VALUE_MODULES.add(module.__name__)


def unregister_pickle_by_value(module):
"""Unregister that the input module should be pickled by value."""
if not isinstance(module, types.ModuleType):
raise ValueError(
f"Input should be a module object, got {str(module)} instead"
)
if module.__name__ not in _PICKLE_BY_VALUE_MODULES:
raise ValueError(f"{module} is not registered for pickle by value")
else:
_PICKLE_BY_VALUE_MODULES.remove(module.__name__)


def list_registry_pickle_by_value():
return _PICKLE_BY_VALUE_MODULES.copy()


def _is_registered_pickle_by_value(module):
module_name = module.__name__
if module_name in _PICKLE_BY_VALUE_MODULES:
return True
while True:
parent_name = module_name.rsplit(".", 1)[0]
if parent_name == module_name:
break
if parent_name in _PICKLE_BY_VALUE_MODULES:
return True
module_name = parent_name
return False


def _whichmodule(obj, name):
"""Find the module an object belongs to.
Expand Down Expand Up @@ -170,18 +244,35 @@ def _whichmodule(obj, name):
return None


def _is_importable(obj, name=None):
"""Dispatcher utility to test the importability of various constructs."""
if isinstance(obj, types.FunctionType):
return _lookup_module_and_qualname(obj, name=name) is not None
elif issubclass(type(obj), type):
return _lookup_module_and_qualname(obj, name=name) is not None
def _should_pickle_by_reference(obj, name=None):
"""Test whether an function or a class should be pickled by reference
Pickling by reference means by that the object (typically a function or a
class) is an attribute of a module that is assumed to be importable in the
target Python environment. Loading will therefore rely on importing the
module and then calling `getattr` on it to access the function or class.
Pickling by reference is the only option to pickle functions and classes
in the standard library. In cloudpickle the alternative option is to
pickle by value (for instance for interactively or locally defined
functions and classes or for attributes of modules that have been
explicitly registered to be pickled by value.
"""
if isinstance(obj, types.FunctionType) or issubclass(type(obj), type):
module_and_name = _lookup_module_and_qualname(obj, name=name)
if module_and_name is None:
return False
module, name = module_and_name
return not _is_registered_pickle_by_value(module)

elif isinstance(obj, types.ModuleType):
# We assume that sys.modules is primarily used as a cache mechanism for
# the Python import machinery. Checking if a module has been added in
# is sys.modules therefore a cheap and simple heuristic to tell us whether
# we can assume that a given module could be imported by name in
# another Python process.
# is sys.modules therefore a cheap and simple heuristic to tell us
# whether we can assume that a given module could be imported by name
# in another Python process.
if _is_registered_pickle_by_value(obj):
return False
return obj.__name__ in sys.modules
else:
raise TypeError(
Expand Down Expand Up @@ -839,10 +930,15 @@ def _decompose_typevar(obj):


def _typevar_reduce(obj):
# TypeVar instances have no __qualname__ hence we pass the name explicitly.
# TypeVar instances require the module information hence why we
# are not using the _should_pickle_by_reference directly
module_and_name = _lookup_module_and_qualname(obj, name=obj.__name__)

if module_and_name is None:
return (_make_typevar, _decompose_typevar(obj))
elif _is_registered_pickle_by_value(module_and_name[0]):
return (_make_typevar, _decompose_typevar(obj))

return (getattr, module_and_name)


Expand Down
12 changes: 6 additions & 6 deletions cloudpickle/cloudpickle_fast.py
Expand Up @@ -28,7 +28,7 @@
from .compat import pickle, Pickler
from .cloudpickle import (
_extract_code_globals, _BUILTIN_TYPE_NAMES, DEFAULT_PROTOCOL,
_find_imported_submodules, _get_cell_contents, _is_importable,
_find_imported_submodules, _get_cell_contents, _should_pickle_by_reference,
_builtin_type, _get_or_create_tracker_id, _make_skeleton_class,
_make_skeleton_enum, _extract_class_dict, dynamic_subimport, subimport,
_typevar_reduce, _get_bases, _make_cell, _make_empty_cell, CellType,
Expand Down Expand Up @@ -352,7 +352,7 @@ def _memoryview_reduce(obj):


def _module_reduce(obj):
if _is_importable(obj):
if _should_pickle_by_reference(obj):
return subimport, (obj.__name__,)
else:
# Some external libraries can populate the "__builtins__" entry of a
Expand Down Expand Up @@ -414,7 +414,7 @@ def _class_reduce(obj):
return type, (NotImplemented,)
elif obj in _BUILTIN_TYPE_NAMES:
return _builtin_type, (_BUILTIN_TYPE_NAMES[obj],)
elif not _is_importable(obj):
elif not _should_pickle_by_reference(obj):
return _dynamic_class_reduce(obj)
return NotImplemented

Expand Down Expand Up @@ -559,7 +559,7 @@ def _function_reduce(self, obj):
As opposed to cloudpickle.py, There no special handling for builtin
pypy functions because cloudpickle_fast is CPython-specific.
"""
if _is_importable(obj):
if _should_pickle_by_reference(obj):
return NotImplemented
else:
return self._dynamic_function_reduce(obj)
Expand Down Expand Up @@ -763,7 +763,7 @@ def save_global(self, obj, name=None, pack=struct.pack):
)
elif name is not None:
Pickler.save_global(self, obj, name=name)
elif not _is_importable(obj, name=name):
elif not _should_pickle_by_reference(obj, name=name):
self._save_reduce_pickle5(*_dynamic_class_reduce(obj), obj=obj)
else:
Pickler.save_global(self, obj, name=name)
Expand All @@ -775,7 +775,7 @@ def save_function(self, obj, name=None):
Determines what kind of function obj is (e.g. lambda, defined at
interactive prompt, etc) and handles the pickling appropriately.
"""
if _is_importable(obj, name=name):
if _should_pickle_by_reference(obj, name=name):
return Pickler.save_global(self, obj, name=name)
elif PYPY and isinstance(obj.__code__, builtin_code_type):
return self.save_pypy_builtin_func(obj)
Expand Down

0 comments on commit 343da11

Please sign in to comment.