Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What is the recommended way to serialize Pipelines with custom transformers? #17390

Closed
Ben-Epstein opened this issue May 29, 2020 · 9 comments

Comments

@Ben-Epstein
Copy link

When building a Pipeline with custom transformers, what is the best way to serialize that for later use?

If you use pickle, you need to define those functions in the new environment, so that doesn't seem like a solution to me. I ran into the same issue with dill and joblib.

What is the best practice here?

Thanks!

@avinashpancham
Copy link

+1. Currently facing the same issue...

@Ben-Epstein
Copy link
Author

@avinashpancham I figured it out, you can use cloudpickle. Then you can unpickle with regular pickle when you want to use it.

@avinashpancham
Copy link

Thanks, that indeed works :)

@mariyamiteva
Copy link

I am currently facing an issue with serializing a Pipeline with transformers.

May I ask you to insert code or add a link with the aforementioned approach using cloudpickle?

Thanks!

@Ben-Epstein
Copy link
Author

@mariyamiteva if you share your code here maybe I can help you out

@cakemountain
Copy link

@mariyamiteva @avinashpancham @Ben-Epstein did you ever end up figuring this out? I'm also having this issue, even when I use cloudpickle (or dill for that matter) to serialize the Pipeline: ModuleNotFoundError: No module named '<needfulmodule>'. If at all possible I would really like to not have to duplicate code to get this working; otherwise what's the point of even using serialization/deserialization?

Say I have a pipeline like this one defined in a module called processing.py:

# processing.py

vectorizer = Pipeline(
            [
                ("do_something", do_something(run=True)),
                ("do_something_else", do_something_else(run=True))
            ])

And then I serialize it as follows:

from cloudpickle import dump

with open("vectorizer.pkl", "wb") as pkl_file:
        dump(vectorizer, pkl_file)

When I go to deserialize it in another application:

from cloudpickle import load

with open('vectorizer.pkl', 'rb') as pkl_file:
        vectorizer = load(pkl_file)

I get a stack trace:

 Traceback (most recent call last):
   File "/usr/src/app/question_answering/process.py", line 12, in _get_processors
     vectorizer = cloudpickle.load(vectorizer_file)
 ModuleNotFoundError: No module named 'processing'

@Ben-Epstein
Copy link
Author

Ben-Epstein commented Jan 20, 2022

@cakemountain are you asking if the entire external python package can be pickled with your object? That's a new feature of cloudpickle I believe. It certainly makes sense that it would not be the default, otherwise all pickled objects would be pretty enormous. It's also quite hard to recursively figure out every package that an object depends on.

Nonetheless, it's an extremely useful feature when you need it (which can be often!)

I believe what you're looking for is here cloudpipe/cloudpickle#417

@cakemountain
Copy link

Thanks @Ben-Epstein, I appreciate the link. I spent a lot of time looking for exactly that and was unable to find it. Agreed, obviously pickling an entire module by default wouldn't make sense, but for the use case I described above it's nice to not have to copy-paste files across repositories (🤮) in order to get a deserialized Pipeline to work.

@crcastillo
Copy link

crcastillo commented Aug 10, 2022

@cakemountain I've found cloudpickle (>=2.0.0) addresses your use case. You just need the additional function cloudpickle.register_pickle_by_value().

from processing import vectorizer
  
cloudpickle.register_pickle_by_value(processing)  
with open("vectorizer.pkl", "wb") as pkl_file:
   dump(vectorizer, pkl_file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants