Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove code introduced by #17644 #21359

Closed
lesteve opened this issue Oct 18, 2021 · 3 comments · Fixed by #21539
Closed

Remove code introduced by #17644 #21359

lesteve opened this issue Oct 18, 2021 · 3 comments · Fixed by #21539

Comments

@lesteve
Copy link
Member

lesteve commented Oct 18, 2021

#17644 introduced code in Tree.__setstate__ to deal with pickles saved in a different endianness that the endianness on the machine the pickle is loaded on.

#21237 showed that this fix was not working since there is an error in Tree.__cinit__ (i.e .before __setstate__). This is probably better to remove this tricky code if it is not useful.

Side-comment: this problem is avoided with joblib 1.1 which loads arrays in native endianness (same behaviour as pickle).

While I am at it adding a test with a pickle generated on a big-endian machine would be nice.

@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2021

I agree with the proposed strategy.

@thomasjpfan
Copy link
Member

+1 on removing. Do we want to consider this a bug fix for 1.0.1 or 1.0.2?

@lesteve
Copy link
Member Author

lesteve commented Oct 26, 2021

After some more investigation #17644 is still needed if using pickle, somehow joblib (with version >= 1.1.0) always uses native endianness whereas pickle keeps the pickle endianness for structured arrays ...

import pickle
import numpy as np
import joblib

arr = np.array([(1, 2.0)], dtype=[("myint", ">i8"), ("myfloat", ">f8")])
print(f"original dtype: {arr.dtype}")

numpy_dtype = pickle.loads(pickle.dumps(arr)).dtype
print(f"after numpy dump+load: {numpy_dtype}")

joblib.dump(arr, "/tmp/test.pkl")
joblib_dtype = joblib.load("/tmp/test.pkl").dtype
print(f"after joblib dump+load: {joblib_dtype}")

Output (on a little-endian machine i.e. most common):

original dtype: [('myint', '>i8'), ('myfloat', '>f8')]
after numpy dump+load: [('myint', '>i8'), ('myfloat', '>f8')]
after joblib dump+load: [('myint', '<i8'), ('myfloat', '<f8')]

Side-comment: for simple dtypes e.g. ('float64', 'int64' etc ...) both pickle and joblib use native endianness (i.e. don't respect the endianness of the pickle).

Why this matters in the context of #17644 is that NODE_DTYPE is a structured array with the following dtype (little-endian machine):

dtype=[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')])

node_array.dtype comes from the pickle and can have the opposite endianness if it has been pickled on a big endian machine and loaded with pickle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants