Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status on cloudpickle non-determinism #453

Open
feizerl opened this issue Oct 15, 2021 · 2 comments
Open

Status on cloudpickle non-determinism #453

feizerl opened this issue Oct 15, 2021 · 2 comments
Labels
pickle issue An issue that comes from `pickle` itself, and not `cloudpickle`.

Comments

@feizerl
Copy link

feizerl commented Oct 15, 2021

Hello,

@ogrisel mentioned in this comment (#385 (comment)):

Anyway, having deterministic pickles is probably out of scope for cloudpickle so I would be in favor of closing this issue.

However, @ogrisel also pushed a PR (#428) which was released as part of cloudpickle 2.0.0 that tried to address non determinism owing to dictionary ordering.

I wanted to confirm what is the official status of the project regarding non determinism because I am still seeing non deterministic pickles in cloudpickle 2.0.0

Here is the pickletools.dis outputs of a function:

...
 2539: (                MARK  
 2540: \x8c                 SHORT_BINUNICODE 'correlation_matrix'                                                                           
 2560: \x94                 MEMOIZE    (as 115)
 2561: \x8c                 SHORT_BINUNICODE 'drifts_annual'                                                                                
 2576: \x94                 MEMOIZE    (as 116)
 2577: \x8c                 SHORT_BINUNICODE 'initial_prices'                                                                               
 2593: \x94                 MEMOIZE    (as 117)                                                                                             
 2594: \x8c                 SHORT_BINUNICODE 'volatilities_annual'                                                                          
 2615: \x94                 MEMOIZE    (as 118)                                                                                             
 2616: \x91                 FROZENSET  (MARK at 2539)
 2617: \x94             MEMOIZE    (as 119)  
...

pickle of a function on second attempt:

...
 2539: (                MARK  
 2540: \x8c                 SHORT_BINUNICODE 'volatilities_annual'                                                                          
 2561: \x94                 MEMOIZE    (as 115)
 2562: \x8c                 SHORT_BINUNICODE 'initial_prices'                                                                               
 2578: \x94                 MEMOIZE    (as 116)
 2579: \x8c                 SHORT_BINUNICODE 'drifts_annual'
 2594: \x94                 MEMOIZE    (as 117)                                                                                             
 2595: \x8c                 SHORT_BINUNICODE 'correlation_matrix'
 2615: \x94                 MEMOIZE    (as 118)                                                                                             
 2616: \x91                 FROZENSET  (MARK at 2539)
 2617: \x94             MEMOIZE    (as 119)
...       

As you can see, the entries are all the same, but shuffled around.

This function is part of a large project, so unfortunately I can't produce a short test case right now.

Notice that kubeflow pipelines implement caching by making sure that pickle of the function hasn't changed. (there is an option to not use pickle as well, but it has its own problems). Having a non deterministic cloudpickle invalidates the cache every time making that feature useless.

Thanks.

@pierreglaser
Copy link
Member

The pickle instructions you're displaying seems like reconstruction instruction for set (more precisely, FrozenSet) instances, which are not ordered (unlike dicts in Python > 3.7). I would imagine that pickles of set objects created by pickle (and not cloudpickle) are also non-deterministic. Not sure if we want to override this behavior (which would in any way work only for the CloudPickler inheriting from the pure-python Pickler, and not the fast C-backed Pickler).

@pierreglaser pierreglaser added the pickle issue An issue that comes from `pickle` itself, and not `cloudpickle`. label Oct 26, 2022
@ogrisel
Copy link
Contributor

ogrisel commented Oct 13, 2023

We could introduce a constructor parameter to implement a slower, deterministic version of cloudpickle. But, indeed I am not sure how we could do that with the subclass of the C-implementation of the CPython pickle.Pickler class.

Note that joblib.hash can do that already, but it does not analyse the content of dynamically defined functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pickle issue An issue that comes from `pickle` itself, and not `cloudpickle`.
Projects
None yet
Development

No branches or pull requests

3 participants