Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FunctionTask hashes depending on imports #457

Closed
JeffWigger opened this issue Apr 16, 2021 · 16 comments · Fixed by #517
Closed

FunctionTask hashes depending on imports #457

JeffWigger opened this issue Apr 16, 2021 · 16 comments · Fixed by #517
Labels
bug Something isn't working hackathon

Comments

@JeffWigger
Copy link

What version of Pydra are you using?

Pydra version: 0.14.1,
Python version: 3.8.5

This is a follow up to #455.
I found another way to create inconsistent hash values for a FunctionTask. I came across it after having used the work around for the first bug in my project. A decorated function that uses several functions that were imported at the beginning of the module had inconsistent hash values too.

I could replicate the behaviour with the following minimal example.

from os.path import basename, splitext
import pydra

cache_dir_tmp = "/home/rerun_test/cache"

@pydra.mark.task
def gunzip_alt(in_file):
    print(in_file)
    out_file = basename(splitext(in_file)[0])
    return "hoi"

    
    
task4 = gunzip_alt(name = "task4", in_file = "/home/rerun_test/t.txt.gz", cache_dir=cache_dir_tmp)
print(task4.inputs.hash)
print(task4.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(task4)

print(task4.result(return_inputs=True))
print(pydra.__version__)

I observed how the hash randomly alternates among two values when rerunning the above programs.

The issue does not occur if the two imports happen inside the gunzip_alt function:

import pydra

cache_dir_tmp = "/home/rerun_test/cache"

@pydra.mark.task
def gunzip_alt(in_file):
    from os.path import basename, splitext
    print(in_file)
    out_file = basename(splitext(in_file)[0])
    return "hoi"

    
    
task4 = gunzip_alt(name = "task4", in_file = "/home/rerun_test/t.txt.gz", cache_dir=cache_dir_tmp)
print(task4.inputs.hash)
print(task4.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(task4)

print(task4.result(return_inputs=True))
print(pydra.__version__)

I also did not observe the issue, if only one imported function was used. If three imported functions were used inside the gunzip_alt function, then I observed it alternating between six hashes.

When comparing the printed inputs of task4, I only notices changes at the end of the functions' byte representation (_func argument) after the __global__ keyword. I suspect that python randomly chooses the order in which it references to these imported functions inside __global__. Hence, the numbers 1, 2, 6 from the factorial sequence.

I am not sure if this is a bug, or whether it should be required from users to import all used functions inside the decorated function.

@JeffWigger JeffWigger added the bug Something isn't working label Apr 16, 2021
@satra
Copy link
Contributor

satra commented Apr 16, 2021

@JW-96 - this is both a bug/not a bug, but the full solution is complicated and would need some work.

the way function hashing works right now is to cloudpickle the function and hash the pickled state. so the hash is going to be dependent on your exact environment and the pickling process and will not survive if packages are updated or the function is switched. the nice thing though is that the pickled function will execute in a similar environment.

we were contemplating a more source based operation similar to what nipype uses. however, this would require significant function introspection and we were considering the dill package to do so. for this part we would hash the source code of the function up to some prescribed depth and recreate the function on a rerun. this would involve detecting functions and dependencies and somehow checking which to reconstruct and which can be relied on from libraries.

@djarecka
Copy link
Collaborator

@satra - I wonder if we should be returning an error if imports are done outside the function

@JeffWigger
Copy link
Author

JeffWigger commented Apr 21, 2021

The hashes are not only inconsistent if functions are not imported, but also if other global variables are used inside the function.
This example below randomly alternates between the two possible hashes.

import pydra

cache_dir_tmp = "/home/rerun_test/cache"
global_var1 = "Hoi"
global_var2 = "Salut"

@pydra.mark.task
def hello1(greeting):
    res = global_var1 +", "+global_var2+" and "+ greeting
    print(res)
    return res
    
hello_task = hello1(name = "hello_task", greeting = "Gruezi", cache_dir=cache_dir_tmp)
print(hello_task.inputs.hash)
print(hello_task.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(hello_task)

print(hello_task.result(return_inputs=True))
print(pydra.__version__)

I think this is more of a problem with Python itself. It should enforce some order in __globals__.
@djarecka I agree that returning an error would be a sensible solution.

@satra
Copy link
Contributor

satra commented Apr 21, 2021

We cannot raise errors since a key goal is to let users reuse functions defined in any library. We are not going to change the python ecosystem to put imports into functions. However, we can do the technical task of introspecting functions to do the kind of state preservation we need. For example,

  • filter globals to only keep the ones relevant to the function
  • introspect the function to see if there are other functions we need to reconstruct from the global namespace. We can ignore imported functions

@effigies
Copy link
Contributor

I might be missing something, but would submitting a patch to sort __globals__ in cloudpickle resolve most of the issues here?

I'd think we could patch here:

     f_globals_ref = _extract_code_globals(func.__code__)
-    f_globals = {k: func.__globals__[k] for k in f_globals_ref if k in
+    f_globals = {k: func.__globals__[k] for k in sorted(f_globals_ref) if k in
                  func.__globals__}

@satra
Copy link
Contributor

satra commented Apr 21, 2021

Yes, it would solve the current issue in this thread, and we should do that.

But this will not address the general problem of hashing functions across environments. Also at this point cloud pickle explicitly states that the pickle should not be seen as consistent on their readme. So having some other options in the pipe would be good.

@effigies
Copy link
Contributor

I see I didn't read carefully enough. You're suggesting to replace cloudpickle with our own hand-rolled version?

@satra
Copy link
Contributor

satra commented Apr 21, 2021

Eventually, for now your solution would cover this use case. I wouldn't have a problem to do this in cloudpickle itself, but they may consider it out of scope. We should set up a chat with olivier.

@effigies
Copy link
Contributor

Took a while to figure out how to write a test, but: cloudpipe/cloudpickle#418.

@satra
Copy link
Contributor

satra commented Apr 21, 2021

thanks @effigies - let's see what they say. alternatively i don't think __globals__ is readonly (since it is base python), and we could sort it ourselves before sending it to cloudpickle.

let me take that back - this won't take care of globals from different modules i think.

@effigies
Copy link
Contributor

this won't take care of globals from different modules i think.

_extract_code_globals is recursive, but I haven't attempted to test. Do you have a minimal test in mind?

@satra
Copy link
Contributor

satra commented Apr 29, 2021

btw, this cloudpickle PR may also help (cloudpipe/cloudpickle#417)

@effigies
Copy link
Contributor

cloudpipe/cloudpickle#428 is in, so any cloudpickle >=1.6.1 should not have the globals non-determinism. That PR does use insertion order, rather than sorting, so it's conceivable that non-determinism could work its way back in, but the tests indicate that (for now) it's resilient to the initial random state.

@satra
Copy link
Contributor

satra commented Jun 30, 2021

nice! any ideas of release schedule?

@effigies
Copy link
Contributor

Nope. It's been a bit. It looks like they're kind of active at the moment, so hopefully this burst will be followed by a release.

@effigies
Copy link
Contributor

effigies commented Mar 7, 2022

Released in 2.0. Upgrading cloudpickle dependency will close this.

@djarecka djarecka added this to To do in Spring 2022 hacking via automation Mar 7, 2022
Spring 2022 hacking automation moved this from To do to Done Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hackathon
Projects
Development

Successfully merging a pull request may close this issue.

4 participants