Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree pickle portability between 64bit and 32bit arch #19602

Closed
aryanxk02 opened this issue Mar 3, 2021 · 23 comments · Fixed by #21552
Closed

Tree pickle portability between 64bit and 32bit arch #19602

aryanxk02 opened this issue Mar 3, 2021 · 23 comments · Fixed by #21552
Labels

Comments

@aryanxk02
Copy link

aryanxk02 commented Mar 3, 2021

Decision Tree Classifier

C:\Users\Aryan\Desktop\heart-host>python app.py
C:\Users\Aryan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.tree.tree module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
  warnings.warn(message, FutureWarning)
Traceback (most recent call last):
  File "app.py", line 5, in <module>
    heart_dtc = pickle.load(open('dtc.pkl', 'rb'))
  File "sklearn\tree\_tree.pyx", line 607, in sklearn.tree._tree.Tree.__cinit__
`ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'`
@glemaitre
Copy link
Member

Are you running the same version scikit-learn than the one used to create the pkl?

@aryanxk02
Copy link
Author

aryanxk02 commented Mar 3, 2021

sklearn version : 0.22.2
YES

@glemaitre
Copy link
Member

Could you as well give information about the OS and CPU architecture where the model was trained and where it is deployed?

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2021

From the type error, one could guess that the model was pickled with Python 32 bit on Windows and loaded on Python 64 bit (e.g. on a Linux server). Is this the case?

If this is confirmed, we should probably consider this a bug of scikit-learn as @rth recently hinted in another issue or PR (that I failed to find).

However we would need to make sure that this bug has not already been fixed. @aryanxk02, can you try to check if you can reproduce the problem with scikit-learn 0.24 or a even better a nightly build?

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

@aryanxk02
Copy link
Author

Could you as well give information about the OS and CPU architecture where the model was trained and where it is deployed?

currently on Windows 64 bit Operating System

@aryanxk02
Copy link
Author

From the type error, one could guess that the model was pickled with Python 32 bit on Windows and loaded on Python 64 bit (e.g. on a Linux server). Is this the case?

If this is confirmed, we should probably consider this a bug of scikit-learn as @rth recently hinted in another issue or PR (that I failed to find).

However we would need to make sure that this bug has not already been fixed. @aryanxk02, can you try to check if you can reproduce the problem with scikit-learn 0.24 or a even better a nightly build?

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

Okay, I developed the entire code (including ML model to pickle file) and ran it on python 32 bit through Jupyter Notebook on Windows.

Sure, regarding scikit-learn 0.24 I will try and will inform.

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2021

What version / bitness of Python and OS do you use to run the flask app?

@aryanxk02
Copy link
Author

What version / bitness of Python and OS do you use to run the flask app?

  1. Python 3.8 - 32 bit
  2. Windows 10 - 64 bit

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2021

If you use the same version and bitness of Python both the training the model and running the flask app then I do not understand how you can get this error.

Can you please run import sklearn; sklearn.show_versions() both in your flask app and in your jupyter notebook to check this?

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2021

Also, can you check that you can run pickle.load(open('dtc.pkl', 'rb')) without error in your notebook right after saving the model to disk?

@rth
Copy link
Member

rth commented Mar 3, 2021

Identical issue was earlier reported in #7891 when unpickling on 32bit arm and in pyodide/pyodide#521

BTW, shouldn't size_t be unsigned, it says long long in the error message, unless the "unsigned" part is swallowed somewhere ? At least on 64bit linux, I get,

$ echo | gcc -E -xc -include 'stddef.h' - | grep size_t
typedef long unsigned int size_t;

In the above linked issue, I mentioned,

In particular it sounds like Tree.n_classes returned a long long while it should have been size_t

but I no longer remember the reasoning behind that comment.

Edit: WebAssembly VM is also little endian same as x86_64 so it's not a byte order issue.

@aryanxk02
Copy link
Author

aryanxk02 commented Mar 3, 2021

If you use the same version and bitness of Python both the training the model and running the flask app then I do not understand how you can get this error.

Can you please run import sklearn; sklearn.show_versions() both in your flask app and in your jupyter notebook to check this?

Hello, regarding this I just checked my python bitness in the Jupyter Notebook where I ran the entire code, it was 64 bit. And the one I used for flask was 32 bit.
Thank you so much!

@rth
Copy link
Member

rth commented Mar 4, 2021

Below are minimal steps to reproduce on a 64bit Linux with a 32 bit docker image.

On a 64bit Linux,

conda create -n test-sklearn python=3.7.3
conda activate test-sklearn
pip install scikit-learn==0.24.1
mkdir tmp/
cd tmp/
python example.py

with
example.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib

iris = load_iris()
clf = RandomForestClassifier().fit(iris.data, iris.target)

joblib.dump(clf, 'estimator.pkl')

Start a 32bit Docker image,

docker run -it --rm -v "$(pwd):/shared" i386/debian
apt update
apt install python3 python3-venv
python -m venv venv
source venv/bin/activate
pip install scikit-learn==0.24.1
python -c "import joblib; joblib.load('/shared/estimator.pkl')"

produces,

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/venv/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 585, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/venv/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 504, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.7/pickle.py", line 1085, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.7/pickle.py", line 1433, in load_reduce
    stack[-1] = func(*args)
  File "sklearn/tree/_tree.pyx", line 595, in sklearn.tree._tree.Tree.__cinit__
ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'

I'll try to investigate in more details later.

@rth rth added the Bug label Mar 4, 2021
@aryanxk02
Copy link
Author

@rth any guidance on how to fix this for Windows OS would be appreciated sir.

@bayesfactor
Copy link

why is this closed? I'm still having this issue with saving a random forest classifier: when training/saving on a 64-bit python and loading/deploying on 32-bit python, I get the error

ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'

@aryanxk02
Copy link
Author

why is this closed? I'm still having this issue with saving a random forest classifier: when training/saving on a 64-bit python and loading/deploying on 32-bit python, I get the error

ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'

Nope, both the interpreters should be same. Either 64bit or 32bit. But same!

@rth rth changed the title Error while running model in pkl format through Flask Tree pickle portability between 64bit and 32bit arch Oct 13, 2021
@rth
Copy link
Member

rth commented Oct 13, 2021

Indeed it's still a valid issue. Re-opening.

The general issue is that we should stop using size_t as dtype in objects that get serialized, since it has a different size depending on the architecture.

@rth rth reopened this Oct 13, 2021
@lesteve
Copy link
Member

lesteve commented Oct 13, 2021

The general issue is that we should stop using size_t as dtype in objects that get serialized, since it has a different size depending on the architecture.

Having looked recently at #21237 (somewhat related issue but for different endianness rather than 64bit vs 32bit) I think there will very likely be more issues down the line for example here:

if (node_ndarray.dtype != NODE_DTYPE):
# possible mismatch of big/little endian due to serialization
# on a different architecture. Try swapping the byte order.
node_ndarray = node_ndarray.byteswap().newbyteorder()
if (node_ndarray.dtype != NODE_DTYPE):
raise ValueError('Did not recognise loaded array dytpe')

  • node_ndarray.dtype is from the pickle and is a structured dtype containing 64bit dtypes
  • NODE_DTYPE is from the machine where the pickle is loaded and is a structured dtype containing 32bit dtypes

Both dtypes won't match and there will be an error.

@rth
Copy link
Member

rth commented Oct 13, 2021

Hmm, yes the endiannes issue is a bit more complex. And it looks like #17644 might have created more issues than it solved. But what do you think we should do for 32/64 bit compatibility, which is probably easier to manage?

Aren't there portable dtypes defined in numpy we could use? Say replace all usize_t with np.int32 (or unsigned long long)? But I'm not sure how efficient that would be for scalars.

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2021

I think the endianess issue is (quite) orthogonal to the int precision issue: even if we had pickles from from the same endianess we would still have a problem with the handling of the precision.

The reason is that the Cython Node datastructure and therefore the NODE_DTYPE are platform specific:

ctypedef np.npy_float32 DTYPE_t          # Type of X
ctypedef np.npy_float64 DOUBLE_t         # Type of y, sample_weight
ctypedef np.npy_intp SIZE_t              # Type for indices and counters

cdef struct Node:
    # Base storage structure for the nodes in a Tree object

    SIZE_t left_child                    # id of the left child of the node
    SIZE_t right_child                   # id of the right child of the node
    SIZE_t feature                       # Feature used for splitting the node
    DOUBLE_t threshold                   # Threshold value at the node
    DOUBLE_t impurity                    # Impurity of the node (i.e., the value of the criterion)
    SIZE_t n_node_samples                # Number of samples at the node
    DOUBLE_t weighted_n_node_samples     # Weighted number of samples at the node

The np.npy_intp type (which is equivalent to ssize_t in C) is platform specific: on 32 bit platforms this is equivalent to np.int32 while it's np.int64 on 64 bit platforms.

Similarly the __cinit__(self, int n_features, np.ndarray[SIZE_t, ndim=1] n_classes, int n_outputs) constructor expects a n_classes numpy array with a platform specific dtype.

I think, the simplest solution would be to put some conversion logic in a new __init__ method (to initialize the n_classes array) as well as the __setstate__ method to convert from int32 or in64 arrays (or recarrays fields in the case of NODE_DTYPE) on the fly. The conversion it self should be a matter of calling .astype on those arrays with the expected dtypes (np.intp for n_classes) and NODE_DTYPE for the nodes in __setstate__.

Note that when both __cinit__ and __init__ are defined, both are called with the same arguments in that order: https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init so it should be ok to delay the initialize and conversion of the n_classes attribute to the __init__ method to be able to do the runtime type conversion.

@thomasjpfan
Copy link
Member

thomasjpfan commented Oct 13, 2021

For us to maintain code that enables portability between 32/64bit systems, I think we need a new CI for testing. The test would consist of:

  1. Build 64bit wheel.
  2. Train models in all_estimators in a 64bit platform and pickle them.
  3. Build 32bit wheel.
  4. Load models in a 32bit platform and predict/transform.

Side note: Numpy has recently decided to not have 32-bit manylinux wheels, so building a 32bit wheel for scikit-learn may become harder.

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2021

That's a possibility but it's quite heavy to setup and maintain.

A simpler solution would be to dumps a bunch of potentially problematic models (e.g. Cython based models such as kd-trees, balltrees, RFs, GBRT and HistGBRT generated by a small helper scripts) on a 64 bit machine once in a while and store the results in a folder of the github repo (the model should be parameterized to be small, e.g. a few kB) and then write a test to check that we can call score and get the expected accuracy in a test that will be run both on 32 bit and 64 bit CIs. This will also be useful to be aware of the (lack of) forward compat of the pickles between consecutive scikit-learn versions. When the class structure changes we will have to regenerate some pickle files but at least we know when and what we break before our users complain about it, even if we do not support pickle compat between scikit-learn versions :)

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2021

Side note: Numpy has recently decided to not have 32-bit manylinux wheels, so building a 32bit wheel for scikit-learn may become harder.

If they stop doing that, then maybe we should stop doing this as well... But I suppose 32 bit platforms will still be supported via debian and conda-forge packages for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants