Cannot convert Variable to NumPy array if it has an invalid missing value #1152

jpivarski · 2022-02-10T21:57:26Z

The missing_value attribute of a Variable is a string, which means that it could be incorrectly formatted (in NumPy's opinion). For example,

>>> v["deploy_date"]
<class 'netCDF4._netCDF4.Variable'>
float32 deploy_date(traj)
    long_name: Deployment date and time
    units: seconds since 1970-01-01 00:00:00 UTC
    missing_value: -1.e+34f
    history: From deplog.dat
unlimited dimensions: 
current shape = (1,)
filling on, default _FillValue of 9.969209968386869e+36 used

>>> np.asarray(v["deploy_date"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "src/netCDF4/_netCDF4.pyx", line 3947, in netCDF4._netCDF4.Variable.__array__
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

The error happens here:

netcdf4-python/src/netCDF4/_netCDF4.pyx

Lines 4897 to 4904 in 37a4088

    
           def _check_safecast(self, attname): 
        
               # check to see that variable attribute exists 
        
               # can can be safely cast to variable data type. 
        
               if hasattr(self, attname): 
        
                   att = numpy.array(self.getncattr(attname)) 
        
               else: 
        
                   return False 
        
               atta = numpy.array(att, self.dtype)

The first numpy.array cast is okay: the attname string becomes a NumPy array of unicode characters, but the second fails: -1.e+34f is not a valid np.float32. I can do the same thing here:

>>> np.array("-1.e+34f")
array('-1.e+34f', dtype='<U8')
>>> np.array(np.array("-1.e+34f"), np.float32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '-1.e+34f'

Without the trailing f, it would be fine:

>>> np.array(np.array("-1.e+34"), np.float32)
array(-1.e+34, dtype=float32)

The irony is that this is correct formatting for C code, and that must be why it was entered this way. I think a library that accepts NetCDF4 files made by other languages should probably accept those languages' number formats, at least. Admittedly, that would complicate the logic of this function.

I'm using NetCDF4 version 1.5.7 from conda-forge.

The text was updated successfully, but these errors were encountered:

jswhit · 2022-02-12T00:11:42Z

the right way to convert a netcdf variable to a numpy array is to slice it (v["deploy_date"][:]])

jpivarski · 2022-02-12T00:14:58Z

It's the same problem because that goes through the same code path:

>>> import netCDF4 as nc
>>> import numpy as np

>>> f = nc.Dataset("drifter_9911352.nc")

>>> f.variables["deploy_date"]
<class 'netCDF4._netCDF4.Variable'>
float32 deploy_date(traj)
    long_name: Deployment date and time
    units: seconds since 1970-01-01 00:00:00 UTC
    missing_value: -1.e+34f
    history: From deplog.dat
unlimited dimensions: 
current shape = (1,)
filling on, default _FillValue of 9.969209968386869e+36 used

>>> np.asarray(f.variables["deploy_date"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "src/netCDF4/_netCDF4.pyx", line 3947, in netCDF4._netCDF4.Variable.__array__
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

>>> f.variables["deploy_date"][:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

jswhit · 2022-02-12T00:21:17Z

ah, I see - as you said in your original post the missing_value attribute is a string when it should be a float32. It needs to be the same type as the variable, or least castable to that type.

jpivarski · 2022-02-12T00:22:30Z

In this case, it doesn't get cast because the trailing "f" (valid in C) is not a number format NumPy recognizes.

jswhit · 2022-02-12T00:24:35Z

right. so I'd say this is not a bug (or if it is, the right place to fix it is in numpy)

jpivarski · 2022-02-12T00:34:10Z

Is there a specification that puts constraints on missing_value? If the NetCDF4 spec says "it can be any number format recognized as a literal in C," for instance, then I'd expect this library to do that parsing. NumPy makes no claim to recognize C's float literals, which aren't valid in Python.

>>> -1.e+34f
  File "<stdin>", line 1
    -1.e+34f
           ^
SyntaxError: invalid syntax

(Recognizing this format when Python doesn't even do it would be an odd thing to expect of NumPy.)

If, on the other hand, the NetCDF4 spec leaves missing_value's format undefined, then I don't know where to draw the line. Users could put any value they want in there. At least, then, it would be nice to be able to override it.

To be able to read these files at all, I had to use _get (and read the Cython code to understand how it works). I know that I'm not supposed to be using that method at all—it starts with an underscore—but there was no other way to get at the data. (Until somebody pointed out to me today that h5py can read it.)

jswhit · 2022-02-12T01:35:32Z

the netcdf users guide states that _FillValue (which missing_value has been deprecated in favor of) "should have the same data type as the variable it describes". The python interface checks for this, and if the _FillValue or missing_value is not the same type as the variable, it tries to cast it to that type before using it to create a masked array. I think you can work around this by turning off the auto conversion to masked arrays using var.set_auto_mask(False) before using asarray. I can take a closer look if you send post the file somewhere. netcdf4-python won't even let you create a missing_value that isn't castable to the variable type so I can't create my own test.

jswhit · 2022-02-17T19:02:05Z

@jpivarski does setting var.set_auto_mask(False) solve this for you?

jpivarski · 2022-02-17T19:15:55Z

Yes, it does. After setting this parameter to False, I can get a NumPy array from the Variable via np.asarray(var) and var[:].

jswhit · 2022-02-17T19:36:24Z

Good. I've included a fix in PR #1154 that triggers a warning that the attribute can't be cast to the variable type, instead of failing when trying to create the masked array.

jpivarski · 2022-02-17T19:38:42Z

Great, that should do it, thanks!

jswhit added a commit that referenced this issue Feb 17, 2022

fix for issue #1152 (don't fail if missing_value can't be cast)

5b1a69b

jswhit added a commit that referenced this issue Feb 17, 2022

add test for issue #1152

48d81da

jswhit closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot convert Variable to NumPy array if it has an invalid missing value #1152

Cannot convert Variable to NumPy array if it has an invalid missing value #1152

jpivarski commented Feb 10, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jswhit commented Feb 17, 2022

jpivarski commented Feb 17, 2022

jswhit commented Feb 17, 2022 •

edited

jpivarski commented Feb 17, 2022

Cannot convert Variable to NumPy array if it has an invalid missing value #1152

Cannot convert Variable to NumPy array if it has an invalid missing value #1152

Comments

jpivarski commented Feb 10, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jpivarski commented Feb 12, 2022

jswhit commented Feb 12, 2022

jswhit commented Feb 17, 2022

jpivarski commented Feb 17, 2022

jswhit commented Feb 17, 2022 • edited

jpivarski commented Feb 17, 2022

jswhit commented Feb 17, 2022 •

edited