cloudpickle
makes it possible to serialize Python constructs not supported
by the default pickle
module from the Python standard library.
cloudpickle
is especially useful for cluster computing where Python
code is shipped over the network to execute on remote hosts, possibly close
to the data.
Among other things, cloudpickle
supports pickling for lambda functions
along with functions and classes defined interactively in the
__main__
module (for instance in a script, a shell or a Jupyter notebook).
Cloudpickle can only be used to send objects between the exact same version of Python.
Using cloudpickle
for long-term object storage is not supported and
strongly discouraged.
Security notice: one should only load pickle data from trusted sources as
otherwise pickle.load
can lead to arbitrary code execution resulting in a critical
security vulnerability.
The latest release of cloudpickle
is available from
pypi:
pip install cloudpickle
Pickling a lambda expression:
>>> import cloudpickle
>>> squared = lambda x: x ** 2
>>> pickled_lambda = cloudpickle.dumps(squared)
>>> import pickle
>>> new_squared = pickle.loads(pickled_lambda)
>>> new_squared(2)
4
Pickling a function interactively defined in a Python shell session
(in the __main__
module):
>>> CONSTANT = 42
>>> def my_function(data: int) -> int:
... return data + CONSTANT
...
>>> pickled_function = cloudpickle.dumps(my_function)
>>> depickled_function = pickle.loads(pickled_function)
>>> depickled_function
<function __main__.my_function(data:int) -> int>
>>> depickled_function(43)
85
An important difference between cloudpickle
and pickle
is that
cloudpickle
can serialize a function or class by value, whereas pickle
can only serialize it by reference, e.g. by serializing its module
attribute path (such as my_module.my_function
).
By default, cloudpickle
only uses serialization by value in cases where
serialization by reference is usually ineffective, for example when the
function/class to be pickled was constructed in an interactive Python session.
Since cloudpickle 1.7.0
, it is possible to extend the use of serialization by
value to functions or classes coming from any pure Python module. This feature
is useful when the said module is unavailable in the unpickling environment
(making traditional serialization by reference ineffective). To this end,
cloudpickle
exposes the
register_pickle_by_value
/unregister_pickle_by_value
functions:
>>> import cloudpickle
>>> import my_module
>>> cloudpickle.register_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by value
>>> cloudpickle.unregister_pickle_by_value(my_module)
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by reference
Note that this feature is still experimental, and may fail in the following situations:
-
If the body of a function/class pickled by value contains an
import
statement:>>> def f(): >>> ... from another_module import g >>> ... # calling f in the unpickling environment may fail if another_module >>> ... # is unavailable >>> ... return g() + 1
-
If a function pickled by reference uses a function pickled by value during its execution.
-
With
tox
, to test run the tests for all the supported versions of Python and PyPy:pip install tox tox
or alternatively for a specific environment:
tox -e py37
-
With
py.test
to only run the tests for your current version of Python:pip install -r dev-requirements.txt PYTHONPATH='.:tests' py.test
cloudpickle
was initially developed by picloud.com and shipped as part of
the client SDK.
A copy of cloudpickle.py
was included as part of PySpark, the Python
interface to Apache Spark. Davies Liu, Josh
Rosen, Thom Neale and other Apache Spark developers improved it significantly,
most notably to add support for PyPy and Python 3.
The aim of the cloudpickle
project is to make that work available to a wider
audience outside of the Spark ecosystem and to make it easier to improve it
further notably with the help of a dedicated non-regression test suite.