-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support writing and filtered reading from bbox columns in GeoParquet #3282
ENH: support writing and filtered reading from bbox columns in GeoParquet #3282
Conversation
Realised that I have enforced |
To appropriately navigate the filtering of I think I actually need to start with a refactor to bring the metadata/schema reading and validation to the front in both |
adef50e
to
d817d6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working om this! Didn't yet take a detailed look, but just a few quick comments
geopandas/io/arrow.py
Outdated
@@ -78,6 +78,9 @@ def _create_metadata(df, schema_version=None): | |||
schema_version : {'0.1.0', '0.4.0', '1.0.0-beta.1', '1.0.0', None} | |||
GeoParquet specification version; if not provided will default to | |||
latest supported version. | |||
bbox_column_name : str, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is needed to give the user control over the exact name (although the geoparquet spec allows it). This could also be a boolean flag to enable writing it or not.
geopandas/io/arrow.py
Outdated
geometry_bbox = df.bounds.rename( | ||
OrderedDict( | ||
[("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")] | ||
), | ||
axis=1, | ||
) | ||
df[bbox_column_name] = geometry_bbox.to_dict("records") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be done more efficiently, avoiding we go through the python dictionaries. We can create a pyarrow StructArray manually (and add this as column to the converted table), by using pyarrow.StructArray.from_arrays
.
Something like
bounds = df.geometry.bounds
bbox_col = pa.StructArray.from_arrays([bounds["minx"], ...], names=["xmin", ...])
geopandas/io/arrow.py
Outdated
# read_metadata does not accept a filesystem keyword, so need to | ||
# handle this manually (https://issues.apache.org/jira/browse/ARROW-16719) | ||
if filesystem is not None: | ||
pa_filesystem = _ensure_arrow_fs(filesystem) | ||
with pa_filesystem.open_input_file(path) as source: | ||
metadata = parquet.read_metadata(source).metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably fine for this PR as you just moved this around, but, we just noticing that we should be able to simplify this now, because read_metadata
has a filesystem keyword for several pyarrow releases
This sounds good. In theory we could keep a fallback to use |
geopandas/io/arrow.py
Outdated
|
||
if bbox_column_name: | ||
bounds = df.bounds | ||
df[bbox_column_name] = StructArray.from_arrays( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will have to add this column directly to the pyarrow Table (so after table = Table.from_pandas(df, preserve_index=index)
below). Otherwise this line will still convert the pyarrow struct array to a numpy object dtype array of python dicts, so not avoiding this unnecessary costly conversion
1a7bf68
to
3bed861
Compare
PyArrow provides am utility function to convert a DNF format filters to an expression: |
Thanks Joris, I had just found that a couple of hours ago - the latest commit has that incorporated. I think my last outstanding item was about the user's control of writing the bbox column name. I was wondering if we needed to account for the use case where a user may read a geoparquet file with a non-standard bbox column name, make edits and then save, but wanting to preserve the non-standard bbox column name? If not, I can simplify to use save the bbox column as a predicate and just use |
88c9618
to
ee3d6ff
Compare
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a complete review but some small initial comments on docstrings and edge cases
geopandas/io/arrow.py
Outdated
return True | ||
|
||
|
||
def _check_if_bbox_column_in_parquet(geo_metadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this does not have the most descriptive name, since this appears to check if the "covering" metadata key is present?
geopandas/io/arrow.py
Outdated
|
||
bbox_filter = _get_parquet_bbox_filter(geo_metadata, bbox) if bbox else None | ||
|
||
if_bbox_column_exists = _check_if_bbox_column_in_parquet(geo_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this should throw if read_bbox_column
is true and if_bbox_column_exists
is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! some more comments
geopandas/io/arrow.py
Outdated
read_bbox_column: bool, default False | ||
The bbox column is a struct with the minimum rectangular box that | ||
encompasses the geometry. It is computationally expensive to read | ||
in a struct into a GeoDataFrame. As such, it is default to not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a struct into a GeoDataFrame. As such, it is default to not | |
into a GeoDataFrame. As such, the default is to not |
(when in a GeoDataFrame, it's not really a "struct" anymore, but in practice a python dictionary at the moment (until pandas would support a native struct type), so let's just keep it general "read into a geodataframe")
geopandas/io/arrow.py
Outdated
encompasses the geometry. It is computationally expensive to read | ||
in a struct into a GeoDataFrame. As such, it is default to not | ||
read in this column unless explictly specified as True. | ||
If ``columns`` arguement is used and contains ``bbox``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ``columns`` arguement is used and contains ``bbox``, | |
If ``columns`` argument is used and contains ``bbox``, |
(and +1 on letting the columns keyword override this)
geopandas/io/tests/test_arrow.py
Outdated
] | ||
|
||
|
||
def test_filters_format_as_expression(tmpdir, naturalearth_lowres): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe parametrize the above and this one on a filters
param, as I think that's the only thing that differs here? (to avoid some duplication)
Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Was discussing with @martinfleis in person here, and related to the |
Could we back this column as a pyarrow struct series (https://github.com/pandas-dev/pandas/pull/54977/files)? However as you say it's unlikely to be useful, and one can also get it if explicitly specified in |
In theory one can indeed specify bbox in the list of columns and then specify through pandas to get it as a ArrowDtype with struct. But since that is not yet a default, I would still leave that up to the user. |
I am going to be a bit unconventional and already merge this (we want to show this today at GeoPython's tutorial). But please still leave some review, there are a few follow-ups to take care of anyway:
|
Thanks for your help and guidance on this one @jorisvandenbossche , @m-richards , @martinfleis Should I open a new issue for these follow ups to tie this one off? Good luck at the geopython tutorial today. |
that may be wise, thanks! |
Resolves #3252
When
write_bbox_covering==True
, the covering.bbox in the metadata, and per-row bbox is written. Default is for bbox not to be written. The format of thebbox
is a tuple of (xmin, ymin, xmax, ymax), same as output ofGeoDataFrame.bounds
When reading, bbox can be specified and is converted to a
pyarrow.compute.Expression
to filterread_parquet
. It first checks schema to ensure the file has a covering.bbox data in it and raises an informative error message if not. As conversion of struct data to geopandas is expensive, default is that this column will not be read in unless explicitly specified.TODO:
[x] Currently still getting it to work wtih partitioned datasets and fsspec. It needs to read the schema/metadata early to check for the bbox column if we are going to filter by it. Also need to omit reading the bbox column as default behaviour, but feeding directories and fsspec to
parquet.read_schema
is not as well supported asparquet.read_table
.edit - Addressed by using
pyarrow.dataset.dataset()
to obtain the schema, this will work with file strings, directories, file objects and fsspec in memory. I assume this function does not actually load the data and allows efficient access to the schema. Part of me wants to also use this to change theread_table
function, but perhaps that is out of scope.[x] Figure out how to splice
filters
kwarg withbbox
if filters was fed with DNF format and bbox is currently converted topyarrow.compute
expression.parquet.filter_to_expression()
[?] Resolve whether flexibility of what column name to give bbox should be provided to user. Edge cases to be considered is if that column name already exists, what flexibility should be afforded to user in reading. How do we address if user reads from a parquet file with a non-standard bbox column name and wants to save with that same non-standard name.
[?] Confirm if acceptable that
read_bbox_column
predicate is ignored if an explicit list of column names is fed toread_parquet
.