Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to treat resources from specific modules as dynamic #391

Closed

Conversation

kinghuang
Copy link

@kinghuang kinghuang commented Jul 8, 2020

Cloudpickle generally considers objects that have a module (i.e., the __module__ attribute) that is in sys.modules to be importable at unpickling time, and only stores a reference to the object instead of fully pickling the underlying code/data. This behaviour is inconvenient in specific cases. This PR adds an option for registering specific modules as "dynamic", so that cloudpickle will fully pickle objects belonging to module instead of just a reference.

Problem to solve

Dask.distributed

In Dask.distributed, code is pickled and sent from clients to workers for processing. For example, you might apply a function to every row in a dataframe like so.

df.apply(lambda x: x.sum(), axis=1)

The lambda function will get pickled and sent to workers. This is fine for simple, self contained lambdas. But, if the Dask code is organized in a module, and the apply function is passed a function in the module, then the module has to be installed on Dask workers because cloudpickle will only send a reference to the function.

def sum_everything(row):
    return sum(row)

df.apply(sum_everything, axis=1)

This is highly inconvenient as sum_everything in this example is just a part of the Dask graph definition, and not a shared piece of code meant to be distributed and installed on the Dask workers.

Dagster

Dagster pipelines can be defined in Python modules and run on remote clusters. Similar to the Dask use case, if a pipeline solid makes a reference to a function in its module instead of being written all inline, then that function will only get pickled as a reference. Again, this makes it necessary to distribute a module that's not meant to be distributed, just to satisfy the pickling process.

Interactive Development

Pickling code from modules is also convenient during development, where it can be cumbersome or time-consuming to build and install modules repeatedly in remote environments.

Solution

This PR adds a register_dynamic_module(module) function that takes in a module or name of a module to treat as dynamic. The module's named is stored in a set named _CUSTOM_DYNAMIC_MODULES_BY_NAME.

For the Dask example above, let's say the code was in a module named some_module.info. Before computing the Dask graph, the module can be registered as dynamic with the following.

import cloudpickle
from foo import bar

cloudpickle.register_dynamic_module(bar)        # Method 1: Use module object
cloudpickle.register_dynamic_module("foo.bar")  # Method 2: Specify module name
cloudpickle.register_dynamic_module("foo")      # Alternative: Pass in parent module or name

In _lookup_module_and_qualname(obj, name=None), a new condition is added after the check for module_name is None and module_name == "__main__" to see if _is_dynamic_module(module_name) returns True. If so, then None is returned, and the obj is fully pickled just like the None and __main__ cases.

The _is_dynamic_module(module, submodules=True) function will return true if the module argument is in the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. If submodules is true (the default), then the function will also return true if it is a submodule of a registered dynamic module.

Given the submodule handling, the entire foo could be registered as dynamic, which would cover foo.bar and any other submodules.

Testing

I'm not sure how best to write test case(s) for this. For local spot testing, I added a submodule under cloudpickle named test, and verified things like so.

root@fbba985331f1:/project# python
Python 3.8.3 (default, Jun  9 2020, 17:39:39) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cloudpickle
>>> from cloudpickle import test
>>> cloudpickle.dumps(test.demo)
b'\x80\x05\x95\x1d\x00\x00\x00\x00\x00\x00\x00\x8c\x10cloudpickle.test\x94\x8c\x04demo\x94\x93\x94.'
>>> cloudpickle.register_dynamic_module(test)
>>> cloudpickle.dumps(test.demo)
b'\x80\x05\x95"\x02\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\r_builtin_type\x94\x93\x94\x8c\nLambdaType\x94\x85\x94R\x94(h\x02\x8c\x08CodeType\x94\x85\x94R\x94(K\x00K\x00K\x00K\x00K\x01KCC\x04d\x01S\x00\x94N\x8c\x01a\x94\x86\x94))\x8c%/project/cloudpickle/test/__init__.py\x94\x8c\x04demo\x94K\x01C\x02\x00\x01\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94\x8c\x10cloudpickle.test\x94\x8c\x08__name__\x94h\x13\x8c\x08__path__\x94]\x94\x8c\x19/project/cloudpickle/test\x94a\x8c\x08__file__\x94\x8c%/project/cloudpickle/test/__init__.py\x94uNNNt\x94R\x94\x8c\x1ccloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x1b}\x94}\x94(h\x14h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x13\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0.'

Other

I've been using a variation of this, written as a runtime patch on cloudpickle, for developing Dask code. It's worked well for me!

Closes #206

Create a set that will be used to declare modules whose resources
should be treated as dynamic during pickling. The default behaviour is
to reference module-backed resources, instead of pickling them, which
is inconvenient in certain use cases.
register_dynamic_module takes a module object or a module name and adds
it to the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. Registered modules can
be remove with unregister_dynamic_module.
@codecov
Copy link

codecov bot commented Jul 8, 2020

Codecov Report

Merging #391 into master will decrease coverage by 1.81%.
The diff coverage is 52.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #391      +/-   ##
==========================================
- Coverage   91.40%   89.59%   -1.82%     
==========================================
  Files           3        3              
  Lines         640      663      +23     
  Branches      134      139       +5     
==========================================
+ Hits          585      594       +9     
- Misses         34       44      +10     
- Partials       21       25       +4     
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 83.38% <52.17%> (-3.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da2a604...f789f2f. Read the comment docs.

_is_dynamic_module returns True if the specified module is in the set
of modules that should be considered dynamic for pickling purposes. If
submodules is True, then the module's parent modules will be searched
to see if they were registered as dynamic.
Add a condition while looking up modules to see if the module is
registered as dynamic, returning None if so. This means resources
belonging to modules registered as dynamic will be treated as if they
had no module, or belonged to __main__.
@kinghuang kinghuang changed the title WIP: Add option to treat resources from specific modules as dynamic Add option to treat resources from specific modules as dynamic Jul 8, 2020
@ogrisel
Copy link
Contributor

ogrisel commented Jul 15, 2020

This seems like a valid use case.

For testing, one solution would be to pickle a function defined in a module registered as dynamic, load the pickled function in a subprocess and check that the module of that function is not in sys.modules (the module has not been imported in the subprocess).

Have a look at test that use from testutils import subprocess_worker for inspiration.

On top of the missing test, I think it would also be nice to be able to register dynamic modules on a specific CloudPickler instance for users who would like to write code without module level side effects. WDYT?

@kinghuang
Copy link
Author

Ah, I totally forgot about this PR. 😬 I'll try to find some time this week to add tests.

@bonelli
Copy link

bonelli commented Nov 15, 2020

Ah, I totally forgot about this PR. 😬 I'll try to find some time this week to add tests.

Please do, this PR would be extremely useful to me: https://stackoverflow.com/questions/64848079/sharing-a-dask-cluster-between-projects-with-different-module-versions

@bonelli
Copy link

bonelli commented Nov 16, 2020

@kinghuang how would this work in Dask Distributed if 2 different clients connected to the same scheduler/workers and used at the same time different versions of the same module (hence different source code but same names)?

Would there be some issue for name clashing, or each pickled source is managed separately during computation for their respective task?

@pierreglaser
Copy link
Member

Use-case and proposed implementation look reasonable, I can make a deeper review once you added some tests @kinghuang.

About naming: I've been trying to phase out the term dynamic which is not the defining characteristic that should trigger a switch between the two cloudpickle serialization logics (by attribute/by value), see #357.

I would be tempted to use the terms deep/shallow serialization to distinguish between the two logics, and thus to use something like cloudpickle.force_deep_serialization(module_list). But I'm open to suggestions.

@jiangtianli91
Copy link

Hello, I wonder whether there is any update on this pull request? Thanks!

@alexbui91
Copy link

Hello. Are there plans to merge this branch to master?

@michaeldoron59
Copy link

Can somebody please help this PR to be accepted?
I would be happy to do so, but I lack the technical knowledge about what should be done.
Thanks

@nsmith-
Copy link

nsmith- commented Feb 18, 2021

I might give the tests a shot, but I have some questions on the requirements:

  • Suppose foo.bar contains a function like:
def func():
    from . import baz
    return baz.something()

Should cloudpickle.register_dynamic_module("foo.bar") imply that all of foo is pickled so that baz can be accessed or is it up to the user to ensure the dynamic module is registered at the appropriate scope depth?

  • Is there any distinction to be made whether foo is a distributed package (i.e. pkg_resources.get_distribution("foo") returns something) or simply a file or directory in sys.path? Do we want to support pickling distributions? What if they are not pure python? Basically we probably want to preempt the question "Why can't I cloudpickle.register_dynamic_module("tensorflow")" or something similarly outrageous :)

@michaeldoron59
Copy link

michaeldoron59 commented Feb 18, 2021

I might give the tests a shot, but I have some questions on the requirements:

* Suppose `foo.bar` contains a function like:
def func():
    from . import baz
    return baz.something()

Should cloudpickle.register_dynamic_module("foo.bar") imply that all of foo is pickled so that baz can be accessed or is it up to the user to ensure the dynamic module is registered at the appropriate scope depth?

* Is there any distinction to be made whether `foo` is a distributed package (i.e. `pkg_resources.get_distribution("foo")` returns something) or simply a file or directory in `sys.path`? Do we want to support pickling distributions? What if they are not pure python? Basically we probably want to preempt the question "Why can't I `cloudpickle.register_dynamic_module("tensorflow")`" or something similarly outrageous :)
  • In my opinion, the register_dynamic_module should pickle all the necessities of the pickled object, like globals and locals, and the code of the pickled object. It shouldn't go further and pickle every single object of the module. Only what's relevant to the pickled object, and it should be recursive, so that every call to these objects won't cause any crashes.

  • Regarding your foo.bar.baz example, I think it should crash in case the baz object won't be installed on the server side. In this case as a developer I would make the import outside of the func scope, enabling cloudpickle to recognize it in the globals scope.

Of course we are not targeting to allow bizarre calls like the one you mentioned with tensorflow (nor cloudpickle), this is way out of the scope of this project, and shouldn't be considered at all. All we want to support is the functionality of the developer on the cloud in real time, with pure python, and without him needing to upload his/her code to the cloud for every single change. Allowing to send the code by demand. Anything that's more bizarre that requires installations and heavy stuff, can still be pickle by the reduce_ex function, and should be self implemented.
Again, registering a module doesn't mean we actually want to send the whole module, it just means that in case we are pickling anything, it should serialize the code and not assume that it exists on the server side.

@Samreay
Copy link
Contributor

Samreay commented Mar 9, 2021

Stubmled onto this after trying to pickle up a local function that calls another local function thats in different file. This would be great for our use case of saying "These modules are local code, pickle these so you dont try to import them on the cloud". Any movement on this?

@mam619
Copy link

mam619 commented Mar 9, 2021

Currently working on a few projects with mlflow. It's been hard to find a way to upload extra functions that I need to run my models on the cloud. This would make life much easier..!

@Samreay
Copy link
Contributor

Samreay commented Apr 20, 2021

Given there hasnt been movement on the PR for a few months, Ive reimplemented the code changes and added tests over in PR417

@ogrisel
Copy link
Contributor

ogrisel commented Jun 22, 2021

Closing in favor of #417.

@ogrisel ogrisel closed this Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New feature] full code pickle for arbitrary modules
10 participants