Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cythonized pydantic objects in __main__ cannot be pickled #408

Closed
marco-neumann-by opened this issue Jan 21, 2021 · 16 comments
Closed

cythonized pydantic objects in __main__ cannot be pickled #408

marco-neumann-by opened this issue Jan 21, 2021 · 16 comments

Comments

@marco-neumann-by
Copy link
Contributor

marco-neumann-by commented Jan 21, 2021

Abstract

The following code snipped fails with cloudpickle but works with stock pickle if pydantic is cythonized (either via a platform-specific wheel or by having cython installed when calling setup.py):

# bug.py
import cloudpickle
import pydantic
import pickle

class Bar(pydantic.BaseModel):
    a: int

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below

When using the file via main:

$ python bug.py

The error message is:

_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x7fc6808f1040>: attribute lookup lambda12 on pydantic.validators failed

Note that the issue does NOT appear when a non-cythonized pydantic version is used.

Also note that the issue does NOT appear when the file is not __main__, for example:

$ python -c "import bug"

Environment

  • Linux x64
  • Python 3.8.6
  • cloudpickle 1.6.0
  • pydantic 1.7.3 w/ cython enabled

Technical Background

In contrast to pickle, cloudpickle pickles the actual class when it resides in __main__, see the following note in the README:

Among other things, cloudpickle supports pickling for lambda functions
along with functions and classes defined interactively in the
__main__ module (for instance in a script, a shell or a Jupyter notebook).

I THINK that might be the reason why this happens. What's somewhat weird is that the object in question is pydantic.validators.int_validator which CAN actually be pickled:

from pydantic.validators import int_validator
import cloudpickle
import pickle

# both work:
pickle.dumps(int_validator)
cloudpickle.dumps(int_validator)

References

This was first reported in #403 here.

@marco-neumann-by marco-neumann-by changed the title cythonized pydantic objects cannot be pickled cythonized pydantic objects in __main__ cannot be pickled Feb 26, 2021
@ogrisel
Copy link
Contributor

ogrisel commented Mar 23, 2021

Could you please edit the bug report to include the full traceback?

@ogrisel
Copy link
Contributor

ogrisel commented Mar 23, 2021

Also is this problem happening with the current master branch of cloudpickle?

@ogrisel
Copy link
Contributor

ogrisel commented Mar 23, 2021

I believe this was fixed by #409 as I cannot reproduce anymore. We still need to release though.

@ogrisel ogrisel closed this as completed Mar 23, 2021
@LukasMasuch
Copy link

I still get the same error using the cloudpickle version from master in Python 3.8.5:

image

The fix from #409 only seems to target Python version < 3.7.

@kylebarron
Copy link

kylebarron commented Oct 4, 2021

Edited to use cloudpickle from master

This issue should be reopened.

The difference between environments and likely why @ogrisel was unable to reproduce this is because pydantic can be installed with or without Cython support. The Cython version of Pydantic is unsurprisingly significantly faster than the pure-Python version and is also the default install (at least for platforms for which wheels exist).

Here are two examples using virtualenv that should be reproducible, using the same script as @marco-neumann-by defined initially:

# example.py
import cloudpickle
import pydantic
import pickle

class Bar(pydantic.BaseModel):
    a: int

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below

Non-cython Pydantic

Note that the --no-binary pydantic tells pip to install without any Cython files.

virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic --no-binary pydantic

Here you can tell that there are no cython files:

> ls ./.venv/lib/python3.8/site-packages/pydantic/
__init__.py           datetime_parse.py     json.py               tools.py
__pycache__           decorator.py          main.py               types.py
_hypothesis_plugin.py env_settings.py       mypy.py               typing.py
annotated_types.py    error_wrappers.py     networks.py           utils.py
class_validators.py   errors.py             parse.py              validators.py
color.py              fields.py             py.typed              version.py
dataclasses.py        generics.py           schema.py

And the example passes without issue

> python example.py
> echo $?
0

Cython-based Pydantic

Now we install pydantic without use of --no-binary pydantic.

deactivate
rm -rf .venv
virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic

Now you can see that there are built C libraries included with Pydantic:

> ls ./.venv/lib/python3.8/site-packages/pydantic/
__init__.cpython-38-darwin.so           json.cpython-38-darwin.so
__init__.py                             json.py
__pycache__                             main.cpython-38-darwin.so
_hypothesis_plugin.cpython-38-darwin.so main.py
_hypothesis_plugin.py                   mypy.cpython-38-darwin.so
annotated_types.cpython-38-darwin.so    mypy.py
annotated_types.py                      networks.cpython-38-darwin.so
class_validators.cpython-38-darwin.so   networks.py
class_validators.py                     parse.cpython-38-darwin.so
color.cpython-38-darwin.so              parse.py
color.py                                py.typed
dataclasses.cpython-38-darwin.so        schema.cpython-38-darwin.so
dataclasses.py                          schema.py
datetime_parse.cpython-38-darwin.so     tools.cpython-38-darwin.so
datetime_parse.py                       tools.py
decorator.cpython-38-darwin.so          types.cpython-38-darwin.so
decorator.py                            types.py
env_settings.cpython-38-darwin.so       typing.cpython-38-darwin.so
env_settings.py                         typing.py
error_wrappers.cpython-38-darwin.so     utils.cpython-38-darwin.so
error_wrappers.py                       utils.py
errors.cpython-38-darwin.so             validators.cpython-38-darwin.so
errors.py                               validators.py
fields.cpython-38-darwin.so             version.cpython-38-darwin.so
fields.py                               version.py
generics.py

And running our example again, we can see that it fails:

> python example.py
Traceback (most recent call last):
  File "example.py", line 9, in <module>
    cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below
  File "/Users/kbarron/tmp/.venv/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/kbarron/tmp/.venv/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x101cf62b0>: attribute lookup lambda12 on pydantic.validators failed

@kylebarron
Copy link

Also note that the issue does NOT appear when the file is not __main__, for example:

I can also reproduce this, however:

# example.py
import cloudpickle
import pickle
from models import Bar

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below
# models.py
import pydantic

class Bar(pydantic.BaseModel):
    a: int

This works fine, so a quick workaround is to always define Pydantic models in a separate file.

@ericman93
Copy link

I'm still having this issue in cloudpickle 2.0.0
it is only working with non-cython Pydantic
And my Pydantic models declared in a separated file

@crclark
Copy link

crclark commented Apr 13, 2022

@ogrisel I am also still seeing this issue in 2.0.0. The workaround in #408 (comment) works for me, but I believe this issue should be reopened.

@rjurney
Copy link

rjurney commented Aug 19, 2022

I have this issue with pydantic and pyspark.

../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/pandas/map_ops.py:91: in mapInPandas
    udf_column = udf(*[self[col] for col in self.columns])
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:276: in wrapper
    return self(*args)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:249: in __call__
    judf = self._judf
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:215: in _judf
    self._judf_placeholder = self._create_judf(self.func)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:224: in _create_judf
    wrapped_func = _wrap_function(sc, func, self.returnType)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:50: in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/rdd.py:3345: in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/serializers.py:458: in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:73: in dumps
    cp.dump(obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyspark.cloudpickle.cloudpickle_fast.CloudPickler object at 0x7ff5f0410700>
obj = (<function test_graphlet_etl.<locals>.horror_to_movie at 0x7ff5d0e81480>, StructType([StructField('entity_id', StringT...ld('length', LongType(), False), StructField('gross', LongType(), False), StructField('rating', StringType(), False)]))

    def dump(self, obj):
        try:
>           return Pickler.dump(self, obj)
E           _pickle.PicklingError: Can't pickle <cyfunction str_validator at 0x7ff5b0461220>: it's not the same object as pydantic.validators.str_validator

../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:602: PicklingError

@brettc
Copy link

brettc commented Nov 4, 2022

I've just been bitten by this. @ogrisel, can we reopen this issue? The workaround is not an option if you are defining your objects inside a jupyter notebook.
CleanShot 2022-11-05 at 09 23 57@2x

@simon-mo
Copy link

simon-mo commented Nov 4, 2022

@brettc as a workaround, you can define custom serializers to pack and unpack pydantic objects. This might help your use case.

https://github.com/ray-project/ray/blob/eed90495cedad0dc2fb6ea6d430df61e4eac24f4/python/ray/util/serialization_addons.py#L10-L35

@brettc
Copy link

brettc commented Nov 4, 2022

@simon-mo thanks for the tip -- this looks very promising! The error occurs for me when I'm using dask, so I guess you had the same issues in ray. (BTW, ray is amazing. I chose dask for this job because ray seemed like overkill).

@zero1zero
Copy link

I'm still struggling to find a workaround for this issue. My code is not directly defining any pydantic types (although it is used by dependent libraries).

Is there a version upgrade/downgrade that might be the cause? Unclear on where the actual issue is occuring. In my case it looks to be in the chain of uvicorn and kserve:

Traceback (most recent call last):
  File "/.asdf/installs/python/3.9.11/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/Library/Caches/pypoetry/virtualenvs/truss-FUoNelHr-py3.9/lib/python3.9/site-packages/kserve/model_server.py", line 275, in servers_task
    await asyncio.gather(*servers)
  File "/Library/Caches/pypoetry/virtualenvs/truss-FUoNelHr-py3.9/lib/python3.9/site-packages/kserve/model_server.py", line 269, in serve
    server.start()
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <cyfunction str_validator at 0x16b57c790>: it's not the same object as pydantic.validators.str_validator

@dumitrescustefan
Copy link

This still happens. I have to define pydantic models in another file, otherwise I get this error. Even in a simple file where I define a pydantic param class and a Ray actor with a single method, this happens. Using the latest ray, pydantic, etc.

@lesteve
Copy link
Contributor

lesteve commented Dec 5, 2023

I agree this issue still exists and I believe it is actually fixed in pydantic 2.5 (see issue and PR) if you run your script with Python. An issue still exists inside Jupyter/IPython pydantic/pydantic#8232.

If you get a similar error like the one below, it likely means your are using pydantic<2 and I would say this is not super likely to get fixed in pydantic (see https://docs.pydantic.dev/latest/version-policy/#pydantic-v1):

_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x7f5cb91e01e0>: it's not the same object as pydantic.validators.int_validator

In this case, the simplest work-around seems to define your pydantic model in a separate file as noted in #408 (comment)

@rjurney
Copy link

rjurney commented Dec 14, 2023

Can someone remind me of what it means if this is fixed? I think it means Spark can serialize numpy arrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests