BUG: cannot use `np.save` and `allow_pickle=True` with data larger than 4 GB #26224

khood5 · 2024-04-07T22:52:55Z

still getting

pickle.dump(array, fp, protocol=3, **pickle_kwargs)                                                             
OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

with

Python 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:35:02) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.                                              
>>> import pickle                                                                                                   
>>> print(pickle.format_version)                                                                                    
4.0                                                                                                                
 >>> exit

and

Python 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:35:02) [GCC 11.2.0] on linux                    
Type "help", "copyright", "credits" or "license" for more information.                                              
>>> import numpy                                                                                                    
>>> numpy.version.version                                                                                           
'1.26.4'                                                                                                            
>>> exit                                                                                                            
Use exit() or Ctrl-D (i.e. EOF) to exit

I think this is incorrect still set to 3. Or am I missing something?

numpy/numpy/lib/format.py

Line 744 in e59c074

pickle.dump(array, fp, protocol=3, **pickle_kwargs)

Originally posted by @khood5 in #18784 (comment)

charris · 2024-04-07T23:40:02Z

Just call dump with protocol=4 .

khood5 · 2024-04-07T23:46:49Z

Just call dump with protocol=4 .

that line is in the numpy source not my code. I have no way (that i know of) to change it without editing the source code... which is undesirable

charris · 2024-04-07T23:52:28Z

It is just the default, all you need to do to override it is specify it in the call.

khood5 · 2024-04-07T23:58:37Z

Thank you @charris for your time but I'm sorry i dont understand. This is in the def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): method witch does not offer a parameter for the pickle version. Further more I am not calling write_array directly but actually calling np.save which uses write_array and if I am not mistaken save has the function head def save(file, arr, allow_pickle=True, fix_imports=True): so again no option to override the default.

the full stack trace would be

File "/data2/khood/GitHub/MLAudio/convertDataToNumpy.py", line 40, in                                         np.save('neuroTrain.npy', neuroTrain)                                                                             
File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/npyio.py", line 546, in save          format.write_array(fid, arr, allow_pickle=allow_pickle,                                                           
File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/format.py", line 719, in write_array                                                                                                                      pickle.dump(array, fp, protocol=3, **pickle_kwargs)                                                             
OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

If there is a way to tell numpy to use the newer pickle version please let me know!

charris · 2024-04-08T00:23:02Z

Python functions can have default values for arguments, which you see as protocol=3. What that means is that if you don't specify the argument, the default value is used. What is doesn't mean is that you can't specify the value.

In [1]: import pickle

In [2]: a = np.ones(20)

In [3]: with open('test.pkl', 'wb') as f:
   ...:     pickle.dump(a, f, protocol=4)
   ...: 

In [4]:

khood5 · 2024-04-08T00:30:22Z

I appreciate your advice. Just to clarify, are you suggesting that I avoid using the numpy save method and instead opt for directly using pickle to dump the data? My concern is that using pickle might create a file in a different format than npy. Since downstream processes are expecting npy format, can you confirm if the formats are identical?

khood5 · 2024-04-08T00:41:38Z

I suppose it does look like numpy save is just wrapping the dump

charris · 2024-04-08T00:48:21Z

np.save works for normal files:

In [7]: a = np.zeros(2**30)

In [8]: with open('test.pkl', 'wb') as f:
   ...:     np.save(f, a)
   ...: 

In [9]: ls -l test.pkl
-rw-r--r--. 1 charris charris 8589934720 Apr  7 18:43 test.pkl

That is an 8GB file. Are you trying to save an object array?

charris · 2024-04-08T00:49:49Z

Also, are you running on Windows?

charris · 2024-04-08T00:58:17Z

And what numpy version?

khood5 · 2024-04-08T01:14:34Z

Linux (Ubuntu I believe I'm afk atm but I'll check soon) numpy '1.26.4'

Yes it has objects because it's multiple dimensions [10000, 7001, 201] and each dimension would be a ndarray I believe

charris · 2024-04-08T01:51:14Z

and each dimension would be a ndarray I believe

Why would each dimension be an ndarray? How did you make the array? What does array.dtype say?

charris · 2024-04-08T01:52:21Z

And what is the exact call you make to np.save.

rkern · 2024-04-08T02:22:09Z

@charris Please do look at the code he's linking. We explicitly call dump() with protocol=3 with no opportunity for users of np.save() to override it. He has an object array that happens to contain an object that pickle will not dump unless if protocol >= 4. It's not an issue with large-ndarray pickling per se.

charris · 2024-04-08T02:35:38Z

@rkern At this point, the question is why is it an object array.

khood5 · 2024-04-08T04:53:09Z

Also, are you running on Windows?

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

Why would each dimension be an ndarray?

Because a multidimensional ndarray is essentially an ndarray of ndarrays.

In a prior discussion here @charris mentioned that numpy was considering switching to pickle 4 as the default protocol. However, since the comment was made in April 2021 and it's now 2024, it's possible that the line hard-coding pickle 3 was not updated resulting in a bug. While it would be helpful to allow users to override the protocol version, i get that ensuring predictability is important. Finally to add some context I, like the original poster, am also working with neural networks, which can make it hard to make nice small snippets of repeatable code.

I was able to work around this by making sure all elements are floats. I dont know if you want to leave this open to address the fixed protocol version so I'll let you close it if you want

Thank you for your help! I really appreciate your time and having you kinda walk me though it did make me realize there where object in the array that probably should not have been there.

rkern · 2024-04-08T12:28:05Z

@rkern At this point, the question is why is it an object array.

The user has an object array. They want to serialize it in NPY format. This is a thing that we support. However, they happen to have a (standard, builtin) object within that object array that pickle refuses to serialize with the pickle version that we hardcode. That's a problem for us to solve. I'm glad that this user found a better way to organize their arrays that avoided the problem, but it still exists.

ngoldbaum · 2024-04-08T16:31:18Z

Presumably we need a size check to see if we'll hit this corner case and in that case choose a different protocol. Unless of course it's safe to just use the newer protocol always, but I suspect protocol 3 is hardcoded because of backward compatibility concerns.

khood5 · 2024-04-08T18:11:01Z

If I where you guys I'd just change it to a default that can be overridden. You're already accepting kwargs for the dump. I'm not even really sure why the protocol was excluded from it

khood5 · 2024-04-08T18:13:57Z

Or just let pickle decide the default. According to this the current default is 4. I think 4 is backwards compatible with 3

mattip · 2024-04-08T21:39:00Z

I think it would be fine to bump the protocol to 4. This is the default since python 3.8, we now (NumPy 2.1) only support python 3.10 and up.

ngoldbaum changed the title ~~OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher~~ BUG: cannot use np.save and allow_pickle=True with data larger than 4 GB Apr 11, 2024

vxst mentioned this issue May 6, 2024

BUG: Use default Python pickle protocol version for np.save #26388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: cannot use `np.save` and `allow_pickle=True` with data larger than 4 GB #26224

BUG: cannot use `np.save` and `allow_pickle=True` with data larger than 4 GB #26224

khood5 commented Apr 7, 2024 •

edited

charris commented Apr 7, 2024

khood5 commented Apr 7, 2024 •

edited

charris commented Apr 7, 2024

khood5 commented Apr 7, 2024 •

edited

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024

khood5 commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

rkern commented Apr 8, 2024

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024 •

edited

rkern commented Apr 8, 2024

ngoldbaum commented Apr 8, 2024

khood5 commented Apr 8, 2024

khood5 commented Apr 8, 2024 •

edited

mattip commented Apr 8, 2024

BUG: cannot use np.save and allow_pickle=True with data larger than 4 GB #26224

BUG: cannot use np.save and allow_pickle=True with data larger than 4 GB #26224

Comments

khood5 commented Apr 7, 2024 • edited

charris commented Apr 7, 2024

khood5 commented Apr 7, 2024 • edited

charris commented Apr 7, 2024

khood5 commented Apr 7, 2024 • edited

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024

khood5 commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024

charris commented Apr 8, 2024

charris commented Apr 8, 2024

rkern commented Apr 8, 2024

charris commented Apr 8, 2024

khood5 commented Apr 8, 2024 • edited

rkern commented Apr 8, 2024

ngoldbaum commented Apr 8, 2024

khood5 commented Apr 8, 2024

khood5 commented Apr 8, 2024 • edited

mattip commented Apr 8, 2024

BUG: cannot use `np.save` and `allow_pickle=True` with data larger than 4 GB #26224

BUG: cannot use `np.save` and `allow_pickle=True` with data larger than 4 GB #26224

khood5 commented Apr 7, 2024 •

edited

khood5 commented Apr 7, 2024 •

edited

khood5 commented Apr 7, 2024 •

edited

khood5 commented Apr 8, 2024 •

edited

khood5 commented Apr 8, 2024 •

edited