Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: cannot use np.save and allow_pickle=True with data larger than 4 GB #26224

Open
khood5 opened this issue Apr 7, 2024 · 21 comments
Open

Comments

@khood5
Copy link

khood5 commented Apr 7, 2024

still getting

pickle.dump(array, fp, protocol=3, **pickle_kwargs)                                                             
OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

with

Python 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:35:02) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.                                              
>>> import pickle                                                                                                   
>>> print(pickle.format_version)                                                                                    
4.0                                                                                                                
 >>> exit 

and

Python 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:35:02) [GCC 11.2.0] on linux                    
Type "help", "copyright", "credits" or "license" for more information.                                              
>>> import numpy                                                                                                    
>>> numpy.version.version                                                                                           
'1.26.4'                                                                                                            
>>> exit                                                                                                            
Use exit() or Ctrl-D (i.e. EOF) to exit 

I think this is incorrect still set to 3. Or am I missing something?

pickle.dump(array, fp, protocol=3, **pickle_kwargs)

Originally posted by @khood5 in #18784 (comment)

@charris
Copy link
Member

charris commented Apr 7, 2024

Just call dump with protocol=4 .

@khood5
Copy link
Author

khood5 commented Apr 7, 2024

Just call dump with protocol=4 .

that line is in the numpy source not my code. I have no way (that i know of) to change it without editing the source code... which is undesirable

@charris
Copy link
Member

charris commented Apr 7, 2024

It is just the default, all you need to do to override it is specify it in the call.

@khood5
Copy link
Author

khood5 commented Apr 7, 2024

Thank you @charris for your time but I'm sorry i dont understand. This is in the def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): method witch does not offer a parameter for the pickle version. Further more I am not calling write_array directly but actually calling np.save which uses write_array and if I am not mistaken save has the function head def save(file, arr, allow_pickle=True, fix_imports=True): so again no option to override the default.

the full stack trace would be

File "/data2/khood/GitHub/MLAudio/convertDataToNumpy.py", line 40, in                                         np.save('neuroTrain.npy', neuroTrain)                                                                             
File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/npyio.py", line 546, in save          format.write_array(fid, arr, allow_pickle=allow_pickle,                                                           
File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/format.py", line 719, in write_array                                                                                                                      pickle.dump(array, fp, protocol=3, **pickle_kwargs)                                                             
OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher 

If there is a way to tell numpy to use the newer pickle version please let me know!

@charris
Copy link
Member

charris commented Apr 8, 2024

Python functions can have default values for arguments, which you see as protocol=3. What that means is that if you don't specify the argument, the default value is used. What is doesn't mean is that you can't specify the value.

In [1]: import pickle

In [2]: a = np.ones(20)

In [3]: with open('test.pkl', 'wb') as f:
   ...:     pickle.dump(a, f, protocol=4)
   ...: 

In [4]: 

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

I appreciate your advice. Just to clarify, are you suggesting that I avoid using the numpy save method and instead opt for directly using pickle to dump the data? My concern is that using pickle might create a file in a different format than npy. Since downstream processes are expecting npy format, can you confirm if the formats are identical?

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

I suppose it does look like numpy save is just wrapping the dump

@charris
Copy link
Member

charris commented Apr 8, 2024

np.save works for normal files:

In [7]: a = np.zeros(2**30)

In [8]: with open('test.pkl', 'wb') as f:
   ...:     np.save(f, a)
   ...: 

In [9]: ls -l test.pkl
-rw-r--r--. 1 charris charris 8589934720 Apr  7 18:43 test.pkl

That is an 8GB file. Are you trying to save an object array?

@charris
Copy link
Member

charris commented Apr 8, 2024

Also, are you running on Windows?

@charris
Copy link
Member

charris commented Apr 8, 2024

And what numpy version?

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

Linux (Ubuntu I believe I'm afk atm but I'll check soon) numpy '1.26.4'

Yes it has objects because it's multiple dimensions [10000, 7001, 201] and each dimension would be a ndarray I believe

@charris
Copy link
Member

charris commented Apr 8, 2024

and each dimension would be a ndarray I believe

Why would each dimension be an ndarray? How did you make the array? What does array.dtype say?

@charris
Copy link
Member

charris commented Apr 8, 2024

And what is the exact call you make to np.save.

@rkern
Copy link
Member

rkern commented Apr 8, 2024

@charris Please do look at the code he's linking. We explicitly call dump() with protocol=3 with no opportunity for users of np.save() to override it. He has an object array that happens to contain an object that pickle will not dump unless if protocol >= 4. It's not an issue with large-ndarray pickling per se.

@charris
Copy link
Member

charris commented Apr 8, 2024

@rkern At this point, the question is why is it an object array.

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

Also, are you running on Windows?

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

Why would each dimension be an ndarray?

Because a multidimensional ndarray is essentially an ndarray of ndarrays.

In a prior discussion here @charris mentioned that numpy was considering switching to pickle 4 as the default protocol. However, since the comment was made in April 2021 and it's now 2024, it's possible that the line hard-coding pickle 3 was not updated resulting in a bug. While it would be helpful to allow users to override the protocol version, i get that ensuring predictability is important. Finally to add some context I, like the original poster, am also working with neural networks, which can make it hard to make nice small snippets of repeatable code.

I was able to work around this by making sure all elements are floats. I dont know if you want to leave this open to address the fixed protocol version so I'll let you close it if you want

Thank you for your help! I really appreciate your time and having you kinda walk me though it did make me realize there where object in the array that probably should not have been there.

@rkern
Copy link
Member

rkern commented Apr 8, 2024

@rkern At this point, the question is why is it an object array.

The user has an object array. They want to serialize it in NPY format. This is a thing that we support. However, they happen to have a (standard, builtin) object within that object array that pickle refuses to serialize with the pickle version that we hardcode. That's a problem for us to solve. I'm glad that this user found a better way to organize their arrays that avoided the problem, but it still exists.

@ngoldbaum
Copy link
Member

Presumably we need a size check to see if we'll hit this corner case and in that case choose a different protocol. Unless of course it's safe to just use the newer protocol always, but I suspect protocol 3 is hardcoded because of backward compatibility concerns.

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

If I where you guys I'd just change it to a default that can be overridden. You're already accepting kwargs for the dump. I'm not even really sure why the protocol was excluded from it

@khood5
Copy link
Author

khood5 commented Apr 8, 2024

Or just let pickle decide the default. According to this the current default is 4. I think 4 is backwards compatible with 3

@mattip
Copy link
Member

mattip commented Apr 8, 2024

I think it would be fine to bump the protocol to 4. This is the default since python 3.8, we now (NumPy 2.1) only support python 3.10 and up.

@ngoldbaum ngoldbaum changed the title OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher BUG: cannot use np.save and allow_pickle=True with data larger than 4 GB Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants