Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bitinfo codec #503

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

thodson-usgs
Copy link

@thodson-usgs thodson-usgs commented Jan 28, 2024

Add a new bitinfo codec, which reimplements a numpy-based version of the bitrounding algorithm from the Julia package BitInformation.jl (and the Python package xbitinfo). When used with ds.to_zarr, the codec would compute the real information chunk-wise, which yields better results for data with variable information content.

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

The number of bits of the mantissa to keep. The range allowed
depends on the dtype input data. If keepbits is
equal to the maximum allowed for the data type, this is equivalent
to no transform.
to no transform. Alternatively, pass a function to determine the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an example of where such a bitinformation function may be found?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namely, xbitinfo, which was the impetus for bitround codec. More recently, that group has advocated for computing bitinformation chunk by chunk, using something like

fn = 'air.zarr'
ds.to_compressed_zar(fn, compute=False, mode='w')

dims = ds.air.dims
len_dims = len(dims)
slices = slices_from_chunks(ds.air.chunks)

for b, block in enumerate(ds.air.data.to_delayed().ravel()):
  ds_block = xr.Dataset({'air':(dims, block.compute())})
  rounded_ds = bitrounding(ds_block)
  rounded_ds.to_zarr(fn, region={dims[d]:s for (d,s) in enumerate(slices[b])})

Modifying the codec could make chunk-wise bitrounding workflows much simpler, and my proposed change would be the minimal first step. The passed function only applies to the encoder, so it shouldn't introduce dependencies or vulnerabilities for the decoder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting information like this should go into the docstring, so that users can act on it

@thodson-usgs
Copy link
Author

Thanks, @martindurant, I'll work on example, but as I dig into this, I realize that we can't pass a function in this way: https://github.com/zarr-developers/zarr-python/blob/main/zarr/util.py#L56-64
will lead to

TypeError: Object of type function is not JSON serializable

I'd hoped this would work without modifying zarr-python, but for now, I'll go ahead and make a small modification and continue testing.

@martindurant
Copy link
Member

Ah, the default serialisation of the codecs, as dumped into the zarr metadata, looks at the __dict__ of the instance, and so is catching the function. Prefixing the attribute with "_" would solve this, or you could override get_config. There is actually a good argument for wanting to serialise the full state, since you can create an array without filling it or fill only part and come back to it later. I don't know how you would approach that without storing some indicator of the function in the metadata.

@thodson-usgs
Copy link
Author

Those are both great suggestions. I'll advance one more: add a stripped down xbitinfo to the bitround codec. The implementation looks fairly minimal. Let's see what comes from observingClouds/xbitinfo/issues/257 then reassess.

@joshmoore
Copy link
Member

A heads up that depending on custom code will impact the portability of the codec across language implementations.

@thodson-usgs
Copy link
Author

thodson-usgs commented Jan 30, 2024

observingClouds/xbitinfo/issues/257: their preference is my original suggestion. Potentially yielding something like:

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

@thodson-usgs
Copy link
Author

@martindurant, I tried your suggestion about prefixing, but I'm getting the general impression this approach won't work.
In this example

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

the object created by BitRound(helper_function) isn't actually passed through to the filtering step, only the dictionary. The filter is recreated from that dictionary before it is applied. In other words, if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

If that impression is correct, then reimplementing a basic bitinformation algorithm in numcodecs might be the simplest route.

@martindurant
Copy link
Member

if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

Ah, sorry

@pep8speaks
Copy link

pep8speaks commented Feb 2, 2024

Hello @thodson-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 94:1: E402 module level import not at top of file

Comment last updated at 2024-04-30 21:18:13 UTC

@thodson-usgs
Copy link
Author

Rather than including this in bitround, lets create a bitinfo codec that calls bitround. I've drafted the PR and solicited feedback from the xbitinfo devs. I will begin writing tests.

@thodson-usgs thodson-usgs changed the title WIP: Enable adaptive bitrounding (seeking feedback) WIP: Add bitinfo codec Feb 2, 2024
@thodson-usgs thodson-usgs changed the title WIP: Add bitinfo codec Add bitinfo codec Feb 5, 2024
@thodson-usgs
Copy link
Author

thodson-usgs commented Mar 1, 2024

@observingClouds, @martindurant
As my last commit indicates, I've reverted the codec to match the functionality of the xbitinfo implementation (albeit all computation will be done chunk-wise when called through ds.to_zarr()) I'm happy to continue working across libraries to improve functionality and reduce duplication; however, further development needs to come from xbitinfo, not the codec. In the meantime, I do not plan any further changes to this PR except any additional tests or bug fixes as needed.

@thodson-usgs
Copy link
Author

@martindurant, sorry to bother. Would you run these checks, or have I missed something? Thanks

@martindurant
Copy link
Member

I have no idea why the checks aren't running :|

@thodson-usgs
Copy link
Author

Maybe it's waiting for me to tick all the boxes? Seems a bit of a catch 22, but I'll give it a try.

@thodson-usgs
Copy link
Author

Let me know if this fixed it

git commit --amend --no-edit
git push --force-with-lease

otherwise, I'll open a new PR

@thodson-usgs
Copy link
Author

Windows and OSX builds failing with:

900+ lines of errors and warnings...

  × Building editable for numcodecs (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /usr/local/miniconda/envs/env/bin/python /usr/local/miniconda/envs/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_editable /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp4l4qy_42
  cwd: /Users/runner/work/numcodecs/numcodecs
  Building editable for numcodecs (pyproject.toml): finished with status 'error'
  ERROR: Failed building editable for numcodecs
Failed to build numcodecs
ERROR: Could not build wheels for numcodecs, which is required to install pyproject.toml-based projects
Error: Process completed with exit code 1.

I'm not sure what happened here, but I'll try to replicate this locally.

@thodson-usgs
Copy link
Author

thodson-usgs commented Apr 23, 2024

The Windows build failed because I had a test using dtype=float128, but I don't understand why OSX is failing to install numcodecs with python>=3.10. Can we run these one more time before I go digging into OSX?

@martindurant
Copy link
Member

Indeed, just osx left.

@thodson-usgs
Copy link
Author

Added a usage example and rebased. I have still not resolved why the OSX 3.10 build failed. Hopefully later this week.

@thodson-usgs
Copy link
Author

oh darn, my doc example gets tests too! 😞

@thodson-usgs
Copy link
Author

Well, this passed just fine in a clean, local environment.

But the GitHub runner is erroring with

...
043     >>> import xarray as xr
044     >>> ds = xr.tutorial.open_dataset("air_temperature")
045     >>> from numcodecs import Blosc, BitInfo
046     >>> compressor = Blosc(cname="zstd", clevel=3)
047     >>> filters = [BitInfo(info_level=0.99)]
048     >>> encoding = {"air": {"compressor": compressor, "filters": filters}}
049     >>> _ = ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)
UNEXPECTED EXCEPTION: ImportError("cannot import name 'MutableMapping' from 'collections' (/usr/share/miniconda/envs/env/lib/python3.11/collections/__init__.py)")
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/env/lib/python3.11/doctest.py", line 1355, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest numcodecs.bitinfo.BitInfo[6]>", line 1, in <module>
  File "/usr/share/miniconda/envs/env/lib/python3.11/site-packages/xarray/core/dataset.py", line 2520, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...

Unless there is an obvious fix, I will keep the usage example but remove it from the doctest, then run some test builds on OSX, then resubmit.

@martindurant
Copy link
Member

I suggest you remove the example, perhaps marking it as doctest skip.

@thodson-usgs
Copy link
Author

(investigating local OSX build now)

@thodson-usgs
Copy link
Author

@martindurant, the problem with OSX seems to be a broader issue.
Note the previous commit to main is also failing (#527).
Perhaps we rerun the tests and proceed with any other issues, while we wait for an OSX fix.

thodson-usgs and others added 15 commits June 3, 2024 11:28
The xbitinfo implementation uses a tolerance factor of 1.5. I lowered the tolerance to 1.1, because I was getting poor results with my test data, but it was pointed out that the problem was that my test data were quantized. Quantization is an open issue with xbitinfo too, and we should address it first there, then patch the codec.
@thodson-usgs
Copy link
Author

rebased on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants