Skip to content

Latest commit

 

History

History
146 lines (106 loc) · 4.92 KB

README.md

File metadata and controls

146 lines (106 loc) · 4.92 KB

cloudpickle

Automated Tests codecov.io

cloudpickle makes it possible to serialize Python constructs not supported by the default pickle module from the Python standard library.

cloudpickle is especially useful for cluster computing where Python code is shipped over the network to execute on remote hosts, possibly close to the data.

Among other things, cloudpickle supports pickling for lambda functions along with functions and classes defined interactively in the __main__ module (for instance in a script, a shell or a Jupyter notebook).

Cloudpickle can only be used to send objects between the exact same version of Python.

Using cloudpickle for long-term object storage is not supported and strongly discouraged.

Security notice: one should only load pickle data from trusted sources as otherwise pickle.load can lead to arbitrary code execution resulting in a critical security vulnerability.

Installation

The latest release of cloudpickle is available from pypi:

pip install cloudpickle

Examples

Pickling a lambda expression:

>>> import cloudpickle
>>> squared = lambda x: x ** 2
>>> pickled_lambda = cloudpickle.dumps(squared)

>>> import pickle
>>> new_squared = pickle.loads(pickled_lambda)
>>> new_squared(2)
4

Pickling a function interactively defined in a Python shell session (in the __main__ module):

>>> CONSTANT = 42
>>> def my_function(data: int) -> int:
...     return data + CONSTANT
...
>>> pickled_function = cloudpickle.dumps(my_function)
>>> depickled_function = pickle.loads(pickled_function)
>>> depickled_function
<function __main__.my_function(data:int) -> int>
>>> depickled_function(43)
85

Overriding pickle's serialization mechanism for importable constructs:

An important difference between cloudpickle and pickle is that cloudpickle can serialize a function or class by value, whereas pickle can only serialize it by reference, e.g. by serializing its module attribute path (such as my_module.my_function).

By default, cloudpickle only uses serialization by value in cases where serialization by reference is usually ineffective, for example when the function/class to be pickled was constructed in an interactive Python session.

Since cloudpickle 1.7.0, it is possible to extend the use of serialization by value to functions or classes coming from any pure Python module. This feature is useful when the said module is unavailable in the unpickling environment (making traditional serialization by reference ineffective). To this end, cloudpickle exposes the register_pickle_by_value/unregister_pickle_by_value functions:

>>> import cloudpickle
>>> import my_module
>>> cloudpickle.register_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function)  # my_function is pickled by value
>>> cloudpickle.unregister_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function)  # my_function is pickled by reference

Note that this feature is still experimental, and may fail in the following situations:

  • If the body of a function/class pickled by value contains an import statement:

    >>> def f():
    >>> ... from another_module import g
    >>> ... # calling f in the unpickling environment may fail if another_module
    >>> ... # is unavailable
    >>> ... return g() + 1
  • If a function pickled by reference uses a function pickled by value during its execution.

Running the tests

  • With tox, to test run the tests for all the supported versions of Python and PyPy:

    pip install tox
    tox
    

    or alternatively for a specific environment:

    tox -e py37
    
  • With py.test to only run the tests for your current version of Python:

    pip install -r dev-requirements.txt
    PYTHONPATH='.:tests' py.test
    

History

cloudpickle was initially developed by picloud.com and shipped as part of the client SDK.

A copy of cloudpickle.py was included as part of PySpark, the Python interface to Apache Spark. Davies Liu, Josh Rosen, Thom Neale and other Apache Spark developers improved it significantly, most notably to add support for PyPy and Python 3.

The aim of the cloudpickle project is to make that work available to a wider audience outside of the Spark ecosystem and to make it easier to improve it further notably with the help of a dedicated non-regression test suite.