Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert Variable to NumPy array if it has an invalid missing value #1152

Closed
jpivarski opened this issue Feb 10, 2022 · 11 comments
Closed

Comments

@jpivarski
Copy link

The missing_value attribute of a Variable is a string, which means that it could be incorrectly formatted (in NumPy's opinion). For example,

>>> v["deploy_date"]
<class 'netCDF4._netCDF4.Variable'>
float32 deploy_date(traj)
    long_name: Deployment date and time
    units: seconds since 1970-01-01 00:00:00 UTC
    missing_value: -1.e+34f
    history: From deplog.dat
unlimited dimensions: 
current shape = (1,)
filling on, default _FillValue of 9.969209968386869e+36 used

>>> np.asarray(v["deploy_date"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "src/netCDF4/_netCDF4.pyx", line 3947, in netCDF4._netCDF4.Variable.__array__
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

The error happens here:

def _check_safecast(self, attname):
# check to see that variable attribute exists
# can can be safely cast to variable data type.
if hasattr(self, attname):
att = numpy.array(self.getncattr(attname))
else:
return False
atta = numpy.array(att, self.dtype)

The first numpy.array cast is okay: the attname string becomes a NumPy array of unicode characters, but the second fails: -1.e+34f is not a valid np.float32. I can do the same thing here:

>>> np.array("-1.e+34f")
array('-1.e+34f', dtype='<U8')
>>> np.array(np.array("-1.e+34f"), np.float32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '-1.e+34f'

Without the trailing f, it would be fine:

>>> np.array(np.array("-1.e+34"), np.float32)
array(-1.e+34, dtype=float32)

The irony is that this is correct formatting for C code, and that must be why it was entered this way. I think a library that accepts NetCDF4 files made by other languages should probably accept those languages' number formats, at least. Admittedly, that would complicate the logic of this function.

I'm using NetCDF4 version 1.5.7 from conda-forge.

@jswhit
Copy link
Collaborator

jswhit commented Feb 12, 2022

the right way to convert a netcdf variable to a numpy array is to slice it (v["deploy_date"][:]])

@jpivarski
Copy link
Author

It's the same problem because that goes through the same code path:

>>> import netCDF4 as nc
>>> import numpy as np

>>> f = nc.Dataset("drifter_9911352.nc")

>>> f.variables["deploy_date"]
<class 'netCDF4._netCDF4.Variable'>
float32 deploy_date(traj)
    long_name: Deployment date and time
    units: seconds since 1970-01-01 00:00:00 UTC
    missing_value: -1.e+34f
    history: From deplog.dat
unlimited dimensions: 
current shape = (1,)
filling on, default _FillValue of 9.969209968386869e+36 used

>>> np.asarray(f.variables["deploy_date"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "src/netCDF4/_netCDF4.pyx", line 3947, in netCDF4._netCDF4.Variable.__array__
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

>>> f.variables["deploy_date"][:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/netCDF4/_netCDF4.pyx", line 4445, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 4514, in netCDF4._netCDF4.Variable._toma
  File "src/netCDF4/_netCDF4.pyx", line 4801, in netCDF4._netCDF4.Variable._check_safecast
ValueError: could not convert string to float: '-1.e+34f'

@jswhit
Copy link
Collaborator

jswhit commented Feb 12, 2022

ah, I see - as you said in your original post the missing_value attribute is a string when it should be a float32. It needs to be the same type as the variable, or least castable to that type.

@jpivarski
Copy link
Author

In this case, it doesn't get cast because the trailing "f" (valid in C) is not a number format NumPy recognizes.

@jswhit
Copy link
Collaborator

jswhit commented Feb 12, 2022

right. so I'd say this is not a bug (or if it is, the right place to fix it is in numpy)

@jpivarski
Copy link
Author

Is there a specification that puts constraints on missing_value? If the NetCDF4 spec says "it can be any number format recognized as a literal in C," for instance, then I'd expect this library to do that parsing. NumPy makes no claim to recognize C's float literals, which aren't valid in Python.

>>> -1.e+34f
  File "<stdin>", line 1
    -1.e+34f
           ^
SyntaxError: invalid syntax

(Recognizing this format when Python doesn't even do it would be an odd thing to expect of NumPy.)

If, on the other hand, the NetCDF4 spec leaves missing_value's format undefined, then I don't know where to draw the line. Users could put any value they want in there. At least, then, it would be nice to be able to override it.

To be able to read these files at all, I had to use _get (and read the Cython code to understand how it works). I know that I'm not supposed to be using that method at all—it starts with an underscore—but there was no other way to get at the data. (Until somebody pointed out to me today that h5py can read it.)

@jswhit
Copy link
Collaborator

jswhit commented Feb 12, 2022

the netcdf users guide states that _FillValue (which missing_value has been deprecated in favor of) "should have the same data type as the variable it describes". The python interface checks for this, and if the _FillValue or missing_value is not the same type as the variable, it tries to cast it to that type before using it to create a masked array. I think you can work around this by turning off the auto conversion to masked arrays using var.set_auto_mask(False) before using asarray. I can take a closer look if you send post the file somewhere. netcdf4-python won't even let you create a missing_value that isn't castable to the variable type so I can't create my own test.

@jswhit
Copy link
Collaborator

jswhit commented Feb 17, 2022

@jpivarski does setting var.set_auto_mask(False) solve this for you?

@jpivarski
Copy link
Author

Yes, it does. After setting this parameter to False, I can get a NumPy array from the Variable via np.asarray(var) and var[:].

@jswhit
Copy link
Collaborator

jswhit commented Feb 17, 2022

Good. I've included a fix in PR #1154 that triggers a warning that the attribute can't be cast to the variable type, instead of failing when trying to create the masked array.

@jpivarski
Copy link
Author

Great, that should do it, thanks!

jswhit added a commit that referenced this issue Feb 17, 2022
@jswhit jswhit closed this as completed Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants