Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter error: bad id or parameters or duplicate filter #1205

Open
sval-dev opened this issue Nov 9, 2022 · 14 comments
Open

Filter error: bad id or parameters or duplicate filter #1205

sval-dev opened this issue Nov 9, 2022 · 14 comments

Comments

@sval-dev
Copy link

sval-dev commented Nov 9, 2022

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue
    Version 1.6.1

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
    Python 3.9.13 on Linux x86_64

  • a description of the issue with the steps needed to reproduce it
    When attempting to use zlib compression on str fields, an error is generated rather than working as expected:

Traceback (most recent call last):
  File "/app/som_resampler/som_resampler/tmp.txt", line 5, in <module>
    writer.createVariable("tmp", str, ('Block_Dim',), **comp_kwds)
  File "src/netCDF4/_netCDF4.pyx", line 2962, in netCDF4._netCDF4.Dataset.createVariable
  File "src/netCDF4/_netCDF4.pyx", line 4207, in netCDF4._netCDF4.Variable.__init__
  File "src/netCDF4/_netCDF4.pyx", line 2028, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Filter error: bad id or parameters or duplicate filter

Test case:

import netCDF4
writer = netCDF4.Dataset("/tmp/tmp.nc", "w")
comp_kwds = dict(zlib=True, shuffle=True, complevel=5)
writer.createDimension('Block_Dim', 180)
writer.createVariable("tmp", str, ('Block_Dim',), **comp_kwds)

Setting compression='zlib' instead of zlib=True has the same effect.

@jswhit
Copy link
Collaborator

jswhit commented Nov 9, 2022

unfortunately you can't use compression on variable-length string variables.

@sval-dev
Copy link
Author

sval-dev commented Nov 9, 2022

It works fine for version 1.5.8 though, so this seems like a regression given that previously working code is now broken.

Can you clarify whether this is expected?

@jswhit
Copy link
Collaborator

jswhit commented Nov 10, 2022

The HDF5 library does not support compression of variable length datatypes. The lib may not have returned an error code in earlier versions, but I don't think any compression was actually done.

See discusson at https://forum.hdfgroup.org/t/compression-in-variable-length-datasets-not-working/1276

@sval-dev
Copy link
Author

That discussion seems to say that the filters are properly applied to variable length datatypes, but that only the pointers are compressed which doesn't result in appreciable compression gains.

I would argue that is still preferable to having working code get broken by an update to the library.

We have code that tries to apply zlib compression to all variables it creates, and it worked fine until the update.

It's straightforward for us to work around since we know the issue, but this library has many users and having a breaking change on update that could otherwise be avoided (by updating the library code) seems undesired.

Perhaps this can be changed to warning instead (or otherwise patched) so that existing code isn't broken?
Would stripping compression params for VLType or str datatypes in the createVariable wrapper with a warning message make sense?

Is this something that I should try to address with the C library instead?

@jswhit
Copy link
Collaborator

jswhit commented Nov 10, 2022

It's a netcdf-c lib issue - at some point it started to return a non-zero error code. We could just ignore the error code, but I'm not sure the variable would be correctly initialized.

@jswhit
Copy link
Collaborator

jswhit commented Nov 10, 2022

Here is the relevant netcdf-c PR: Unidata/netcdf-c#2231. This went into version 4.9.0.

@sval-dev
Copy link
Author

sval-dev commented Nov 10, 2022

Thanks for the additional info! Given the upstream PR, it seems like this would be a wonfix upstream.

Would you be willing to entertain a PR implementing the suggestion I made above about changing the Variable class (instantiated by createVariable) in src/netCDF4/_netCDF4.pyx to log a warning and strip compression/zlib arguments before proceeding for str and VLType?

This would allow old code to continue working as it was even though upstream has made the breaking change.

@jswhit
Copy link
Collaborator

jswhit commented Nov 10, 2022

There are other types that compression will not work for besides VLTypes - seems like if this warning were added it should be more general.

@sval-dev
Copy link
Author

sval-dev commented Nov 11, 2022

I see the following:
The datatype can be a numpy datatype object, or a string that describes
a numpy dtype object (like the dtype.str attribute of a numpy array).
Supported specifiers include: 'S1' or 'c' (NC_CHAR), 'i1' or 'b' or 'B' (NC_BYTE), 'u1' (NC_UBYTE), 'i2' or 'h' or 's' (NC_SHORT), 'u2' (NC_USHORT), 'i4' or 'i' or 'l' (NC_INT), 'u4' (NC_UINT), 'i8' (NC_INT64), 'u8' (NC_UINT64), 'f4' or 'f' (NC_FLOAT), 'f8' or 'd' (NC_DOUBLE).
datatype can also be a CompoundType instance
(for a structured, or compound array), a VLType instance
(for a variable-length array), or the python str builtin
(for a variable-length string array). Numpy string and unicode datatypes with
length greater than one are aliases for str.

From the above, it seemed like VLType and str were the variable length types but if other types should be included that seems reasonable to me.

Which other (variable-length) types should be checked for?

And to confirm, does the general idea seem sound and would you be open to a PR implementing this?

@jswhit
Copy link
Collaborator

jswhit commented Nov 11, 2022

Yes, I would be open to a PR. Have to look at the HDF5 docs to see what datatypes support compression. Certainly all the primitive datatypes, not sure about Compound or Enum.

@sval-dev
Copy link
Author

sval-dev commented Nov 11, 2022

I'm not sure what you mean here. I think all the variable data types support compression, but the error only pops up for variable length types.

The error is thrown in libdispatch/dfilter.c and the function which determines if the variable is variable length is NC4_inq_type_fixed_size() over at /libsrc4/nc4type.c

There we see that variable length fields are either strings, VLType, or Compound types which themselves contain either a string or VLType.

You can also see there that ENUM and the primitive datatypes (i.e. xtype < NC_STRING) are not variable length and wouldn't trigger the error message.

@jswhit
Copy link
Collaborator

jswhit commented Nov 11, 2022

OK - I wasn't sure if Enum and Compound types supported compression but I guess they do. So you just need to check for variable-length types.

@sval-dev
Copy link
Author

Thanks for the feedback!

There is now a discussion on this over at Unidata/netcdf-c#2554 and it would be cleaner to take this from upstream if it were possible so I'll try to follow up there and update here on resolution.

@sval-dev
Copy link
Author

sval-dev commented Aug 2, 2023

The discussion resulted in a PR upstream that addresses this issue, although there is still some discussion on further improvements.

Given the fix upstream, I think this may be safe to close, though several versions of the C library when used with this python wrapper will still generate the reported error message, so a fix here might still be independently useful for users on those versions.

I'm still happy to create a PR here implementing what was proposed in #1205 (comment) if it would be helpful, but otherwise it seems like we can close this directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants