Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec v2022.10.0 breaks MultiZarrToZarr #246

Closed
arongergely opened this issue Oct 27, 2022 · 6 comments · Fixed by fsspec/filesystem_spec#1087
Closed

fsspec v2022.10.0 breaks MultiZarrToZarr #246

arongergely opened this issue Oct 27, 2022 · 6 comments · Fixed by fsspec/filesystem_spec#1087

Comments

@arongergely
Copy link

arongergely commented Oct 27, 2022

My environment:
Ubuntu Jammy, python 3.8

I am creating virtual zarr files from NetCDF4s and unifying them into a single virtual Zarr as follows:

import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr


def create_virtual_zarr(nc_filepath):
    """
    Parses NetCDF4 file into a virtual Zarr.
    """
    open_args = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')
    with fsspec.open(nc_filepath, **open_args) as f:
        h5chunks = SingleHdf5ToZarr(f, nc_filepath, inline_threshold=-1)
        single_file_json = h5chunks.translate()
    return single_file_json


filepaths = ['./file1.nc', './file2.nc', './file3.nc']

#create virtual zarrs
virtual_zarrs = list(map(create_virtual_zarr, filepaths))

# combine all virtual Zarr files into one
single_virtual_zarr = MultiZarrToZarr(
    virtual_zarrs,
    remote_protocol='file',
    concat_dims=['time'],
    identical_dims=['y', 'x']
).translate()

This worked with with kerchunk v.0.0.9 and ffspec 2022.8.2.

However when upgrading fsspec to 2022.10.0 I get an error when calling MultiZarrToZarr(...).translate():

Traceback (most recent call last):
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 635, in <module>
    main()
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 632, in main
    forecast_statistics(None, proc_params)
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 373, in forecast_statistics
    single_virtual_zarr = MultiZarrToZarr(
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 468, in translate
    self.second_pass()
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 452, in second_pass
    bits = fs.cat(list(to_download.values()))
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/reference.py", line 345, in cat
    bytes_out = fs.cat_ranges(new_paths, new_starts, new_ends)
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 765, in cat_ranges
    out.append(self.cat_file(p, s, e))
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 719, in cat_file
    return f.read(end - f.tell())
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/local.py", line 337, in read
    return self.f.read(*args, **kwargs)
ValueError: read length must be non-negative or -1

Process finished with exit code 1

Dove into the code and the culprit seems to be some changes to ReferenceFileSystem in fsspec : fsspec/filesystem_spec#1063

When ReferenceFileSystem.cat() gets called within MultiZarrToZarr.second_pass() , the start/end positions of datasets are not preserved correctly, leading to some 0 length data.

During debug the erratic starts/ends were introduced by this subroutine due to a sorting error https://github.com/fsspec/filesystem_spec/blob/2022.10.0/fsspec/implementations/reference.py#L337-L344

@martindurant
Copy link
Member

Can you be more specific about what this sorting error is? The merge offsets function is regularly used by parquet reads, so I really hope it isn't broken!

@arongergely
Copy link
Author

@martindurant merge_offset_ranges() is called with sort=False, so it assumes that the start/end lists are already sorted. But in my case the within MultiZarrToZarr.second_pass() they they seem to be not or reverse sorted.

I had limited time to debug so far, this where I am at

@martindurant
Copy link
Member

Oh I see. So sorting the lists before this step should be enough to solve this.

@martindurant
Copy link
Member

Do you have time to test whether switching the arg to sort=True s enough?

@arongergely
Copy link
Author

arongergely commented Oct 27, 2022

Oh I see. So sorting the lists before this step should be enough to solve this.

Yes hopefully. Not through with debug yet.

Do you have time to test whether switching the arg to sort=True s enough?

Yes, will post an update here

@arongergely
Copy link
Author

Yep, sort=True solves it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants