FunctionTask hashes depending on imports #457

JeffWigger · 2021-04-16T15:06:45Z

What version of Pydra are you using?

Pydra version: 0.14.1,
Python version: 3.8.5

This is a follow up to #455.
I found another way to create inconsistent hash values for a FunctionTask. I came across it after having used the work around for the first bug in my project. A decorated function that uses several functions that were imported at the beginning of the module had inconsistent hash values too.

I could replicate the behaviour with the following minimal example.

from os.path import basename, splitext
import pydra

cache_dir_tmp = "/home/rerun_test/cache"

@pydra.mark.task
def gunzip_alt(in_file):
    print(in_file)
    out_file = basename(splitext(in_file)[0])
    return "hoi"

    
    
task4 = gunzip_alt(name = "task4", in_file = "/home/rerun_test/t.txt.gz", cache_dir=cache_dir_tmp)
print(task4.inputs.hash)
print(task4.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(task4)

print(task4.result(return_inputs=True))
print(pydra.__version__)

I observed how the hash randomly alternates among two values when rerunning the above programs.

The issue does not occur if the two imports happen inside the gunzip_alt function:

import pydra

cache_dir_tmp = "/home/rerun_test/cache"

@pydra.mark.task
def gunzip_alt(in_file):
    from os.path import basename, splitext
    print(in_file)
    out_file = basename(splitext(in_file)[0])
    return "hoi"

    
    
task4 = gunzip_alt(name = "task4", in_file = "/home/rerun_test/t.txt.gz", cache_dir=cache_dir_tmp)
print(task4.inputs.hash)
print(task4.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(task4)

print(task4.result(return_inputs=True))
print(pydra.__version__)

I also did not observe the issue, if only one imported function was used. If three imported functions were used inside the gunzip_alt function, then I observed it alternating between six hashes.

When comparing the printed inputs of task4, I only notices changes at the end of the functions' byte representation (_func argument) after the __global__ keyword. I suspect that python randomly chooses the order in which it references to these imported functions inside __global__. Hence, the numbers 1, 2, 6 from the factorial sequence.

I am not sure if this is a bug, or whether it should be required from users to import all used functions inside the decorated function.

The text was updated successfully, but these errors were encountered:

satra · 2021-04-16T15:52:30Z

@JW-96 - this is both a bug/not a bug, but the full solution is complicated and would need some work.

the way function hashing works right now is to cloudpickle the function and hash the pickled state. so the hash is going to be dependent on your exact environment and the pickling process and will not survive if packages are updated or the function is switched. the nice thing though is that the pickled function will execute in a similar environment.

we were contemplating a more source based operation similar to what nipype uses. however, this would require significant function introspection and we were considering the dill package to do so. for this part we would hash the source code of the function up to some prescribed depth and recreate the function on a rerun. this would involve detecting functions and dependencies and somehow checking which to reconstruct and which can be relied on from libraries.

djarecka · 2021-04-20T22:29:08Z

@satra - I wonder if we should be returning an error if imports are done outside the function

JeffWigger · 2021-04-21T09:38:08Z

The hashes are not only inconsistent if functions are not imported, but also if other global variables are used inside the function.
This example below randomly alternates between the two possible hashes.

import pydra

cache_dir_tmp = "/home/rerun_test/cache"
global_var1 = "Hoi"
global_var2 = "Salut"

@pydra.mark.task
def hello1(greeting):
    res = global_var1 +", "+global_var2+" and "+ greeting
    print(res)
    return res
    
hello_task = hello1(name = "hello_task", greeting = "Gruezi", cache_dir=cache_dir_tmp)
print(hello_task.inputs.hash)
print(hello_task.inputs)

with pydra.Submitter(plugin="cf", n_procs=6) as sub:
    sub(hello_task)

print(hello_task.result(return_inputs=True))
print(pydra.__version__)

I think this is more of a problem with Python itself. It should enforce some order in __globals__.
@djarecka I agree that returning an error would be a sensible solution.

satra · 2021-04-21T11:20:08Z

We cannot raise errors since a key goal is to let users reuse functions defined in any library. We are not going to change the python ecosystem to put imports into functions. However, we can do the technical task of introspecting functions to do the kind of state preservation we need. For example,

filter globals to only keep the ones relevant to the function
introspect the function to see if there are other functions we need to reconstruct from the global namespace. We can ignore imported functions

effigies · 2021-04-21T12:01:50Z

I might be missing something, but would submitting a patch to sort __globals__ in cloudpickle resolve most of the issues here?

I'd think we could patch here:

     f_globals_ref = _extract_code_globals(func.__code__)
-    f_globals = {k: func.__globals__[k] for k in f_globals_ref if k in
+    f_globals = {k: func.__globals__[k] for k in sorted(f_globals_ref) if k in
                  func.__globals__}

satra · 2021-04-21T12:08:25Z

Yes, it would solve the current issue in this thread, and we should do that.

But this will not address the general problem of hashing functions across environments. Also at this point cloud pickle explicitly states that the pickle should not be seen as consistent on their readme. So having some other options in the pipe would be good.

effigies · 2021-04-21T12:26:59Z

I see I didn't read carefully enough. You're suggesting to replace cloudpickle with our own hand-rolled version?

satra · 2021-04-21T12:29:40Z

Eventually, for now your solution would cover this use case. I wouldn't have a problem to do this in cloudpickle itself, but they may consider it out of scope. We should set up a chat with olivier.

effigies · 2021-04-21T14:57:42Z

Took a while to figure out how to write a test, but: cloudpipe/cloudpickle#418.

satra · 2021-04-21T16:23:27Z

~~thanks @effigies - let's see what they say. alternatively i don't think __globals__ is readonly (since it is base python), and we could sort it ourselves before sending it to cloudpickle.~~

let me take that back - this won't take care of globals from different modules i think.

effigies · 2021-04-21T16:30:50Z

this won't take care of globals from different modules i think.

_extract_code_globals is recursive, but I haven't attempted to test. Do you have a minimal test in mind?

satra · 2021-04-29T20:19:03Z

btw, this cloudpickle PR may also help (cloudpipe/cloudpickle#417)

effigies · 2021-06-30T16:58:23Z

cloudpipe/cloudpickle#428 is in, so any cloudpickle >=1.6.1 should not have the globals non-determinism. That PR does use insertion order, rather than sorting, so it's conceivable that non-determinism could work its way back in, but the tests indicate that (for now) it's resilient to the initial random state.

satra · 2021-06-30T17:02:50Z

nice! any ideas of release schedule?

effigies · 2021-06-30T17:04:53Z

Nope. It's been a bit. It looks like they're kind of active at the moment, so hopefully this burst will be followed by a release.

effigies · 2022-03-07T16:33:16Z

Released in 2.0. Upgrading cloudpickle dependency will close this.

JeffWigger added the bug Something isn't working label Apr 16, 2021

effigies mentioned this issue Apr 21, 2021

ENH: Sort __globals__ dict to make pickle string more deterministic cloudpipe/cloudpickle#418

Closed

djarecka added the hackathon label Mar 7, 2022

djarecka added this to To do in Spring 2022 hacking via automation Mar 7, 2022

effigies mentioned this issue Mar 29, 2022

MNT: Require cloudpickle >= 2.0 #517

Merged

8 tasks

djarecka closed this as completed in #517 Mar 31, 2022

Spring 2022 hacking automation moved this from To do to Done Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FunctionTask hashes depending on imports #457

FunctionTask hashes depending on imports #457

JeffWigger commented Apr 16, 2021

satra commented Apr 16, 2021

djarecka commented Apr 20, 2021

JeffWigger commented Apr 21, 2021 •

edited

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021 •

edited

effigies commented Apr 21, 2021

satra commented Apr 29, 2021

effigies commented Jun 30, 2021

satra commented Jun 30, 2021

effigies commented Jun 30, 2021

effigies commented Mar 7, 2022

FunctionTask hashes depending on imports #457

FunctionTask hashes depending on imports #457

Comments

JeffWigger commented Apr 16, 2021

satra commented Apr 16, 2021

djarecka commented Apr 20, 2021

JeffWigger commented Apr 21, 2021 • edited

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021

effigies commented Apr 21, 2021

satra commented Apr 21, 2021 • edited

effigies commented Apr 21, 2021

satra commented Apr 29, 2021

effigies commented Jun 30, 2021

satra commented Jun 30, 2021

effigies commented Jun 30, 2021

effigies commented Mar 7, 2022

JeffWigger commented Apr 21, 2021 •

edited

satra commented Apr 21, 2021 •

edited