Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoArrowEngine error when reading Parquet files #241

Open
darribas opened this issue Mar 16, 2023 · 5 comments
Open

GeoArrowEngine error when reading Parquet files #241

darribas opened this issue Mar 16, 2023 · 5 comments

Comments

@darribas
Copy link

I am trying to read this dataset:

https://github.com/urbangrammarai/signatures_gb

Cloned locally (repo is about 50GB), within an environment made by:

conda create -n alpha dask-geopandas pyogrio ipykernel dask[distributed]

I load up libraries:

import geopandas
import dask_geopandas
from dask.distributed import LocalCluster, Client

client = Client(LocalCluster())

And then, I try to lazily read the dataset:

etcs = dask_geopandas.read_parquet(
    (
        '/home/jovyan/data/spatial_signatures'
        '/signatures_gb/form'
    )
)

Which returns (click for full error):

Full error message
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/backends.py:133, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    132 try:
--> 133     return func(*args, **kwargs)
    134 except Exception as e:

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:513, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
    512 # Extract global filesystem and paths
--> 513 fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
    514     path,
    515     filesystem,
    516     dataset_options,
    517     open_file_options,
    518     storage_options,
    519 )
    520 read_options["open_file_options"] = open_file_options

AttributeError: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 etcs = dask_geopandas.read_parquet(
      2     (
      3         '/home/jovyan/data/spatial_signatures'
      4         '/signatures_gb/form'
      5     )
      6 )

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:111, in read_parquet(*args, **kwargs)
    110 def read_parquet(*args, **kwargs):
--> 111     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    112     # check if spatial partitioning information was stored
    113     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/backends.py:135, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    133     return func(*args, **kwargs)
    134 except Exception as e:
--> 135     raise type(e)(
    136         f"An error occurred while calling the {funcname(func)} "
    137         f"method registered to the {self.backend} backend.\n"
    138         f"Original Message: {e}"
    139     ) from e

AttributeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

A couple of questions:

  1. Do you have any idea what's going on? How to work around it?
  2. This is slightly off this issue but, while I'm at it: is it possible to read it over the wire directly from the repo and download only the partitions that are required for computation?
@TomAugspurger
Copy link
Contributor

Can you share the version of dask and dask-geopandas you're using?

I can't reproduce it with this simple example, regardless of whether I create a client / LocalCluster.

In [15]: import dask_geopandas, geopandas

In [16]: df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)

In [17]: df.to_parquet("/tmp/out.parquet")

In [18]: dask_geopandas.read_parquet("/tmp/out.parquet/").compute()
Out[18]:
   BoroCode       BoroName     Shape_Leng    Shape_Area                                           geometry
0         5  Staten Island  330470.010332  1.623820e+09  MULTIPOLYGON (((970217.022 145643.332, 970227....
1         4         Queens  896344.047763  3.045213e+09  MULTIPOLYGON (((1029606.077 156073.814, 102957...
2         3       Brooklyn  741080.523166  1.937479e+09  MULTIPOLYGON (((1021176.479 151374.797, 102100...
3         1      Manhattan  359299.096471  6.364715e+08  MULTIPOLYGON (((981219.056 188655.316, 980940....
4         2          Bronx  464392.991824  1.186925e+09  MULTIPOLYGON (((1012821.806 229228.265, 101278...

That's with dask 2023.3.1 and dask-geopandas main.

@jorisvandenbossche
Copy link
Member

This extract_filesystem method was added relatively recently (dask/dask#9699), but our GeoArrowEngine subclass the dask engine, so I would expect that we just inherit that method.

@darribas
Copy link
Author

darribas commented Mar 24, 2023

Can you share the version of dask and dask-geopandas you're using?

geopandas.__version__
>>> '0.12.2'
dask_geopandas.__version__
>>> 'v0.3.0'
dask.__version__
>>> '2023.1.1'

To be clear, that version is what conda/mamba picks when I build the environment as described above.

I get the following related error:

Full error message
df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)

df.to_parquet("/tmp/out.parquet")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 3
      1 df = dask_geopandas.from_geopandas(geopandas.read_file(geopandas.datasets.get_path("nybb")), npartitions=2)
----> 3 df.to_parquet("/tmp/out.parquet")

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask_geopandas/core.py:617, in GeoDataFrame.to_parquet(self, path, *args, **kwargs)
    614 """See dask_geopadandas.to_parquet docstring for more information"""
    615 from .io.parquet import to_parquet
--> 617 return to_parquet(self, path, *args, **kwargs)

File /opt/conda/envs/alpha/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:940, in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)
    931     raise ValueError(
    932         "User-defined key/value metadata (custom_metadata) can not "
    933         "contain a b'pandas' key.  This key is reserved by Pandas, "
    934         "and overwriting the corresponding value can render the "
    935         "entire dataset unreadable."
    936     )
    938 # Engine-specific initialization steps to write the dataset.
    939 # Possibly create parquet metadata, and load existing stuff if appending
--> 940 i_offset, fmd, metadata_file_exists, extra_write_kwargs = engine.initialize_write(
    941     df,
    942     fs,
    943     path,
    944     append=append,
    945     ignore_divisions=ignore_divisions,
    946     partition_on=partition_on,
    947     division_info=division_info,
    948     index_cols=index_cols,
    949     schema=schema,
    950     custom_metadata=custom_metadata,
    951     **kwargs,
    952 )
    954 # By default we only write a metadata file when appending if one already
    955 # exists
    956 if append and write_metadata_file is None:

AttributeError: type object 'GeoArrowEngine' has no attribute 'initialize_write'

@TomAugspurger
Copy link
Contributor

Strange. I can't reproduce that using a new conda env with your commands.

As Joris says, this doesn't make sense because dask-geopandas inherits from the dask Arrow engine, so it must have the method.

@jtmiclat
Copy link
Contributor

hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import.

try:
# pyarrow is imported here, but is an optional dependency
from dask.dataframe.io.parquet.arrow import (
ArrowDatasetEngine as DaskArrowDatasetEngine,
)
except ImportError:
DaskArrowDatasetEngine = object

I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported.

To reiterate

this fails

pip install dask dask-geopandas 

this works

pip install dask dask-geopandas  pyarrow
# or 
pip install dask[complete] dask-geopandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants