-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156
Comments
+1 on that. It would be a nice feature to have in 1.0 if we manage to get it in on time (by the end of March roughly). |
The actual code would likely be quite simple. More challenging is coming to consensus on some different options like: should the export always go through WKB? |
Thanks for opening the issue!
That's indeed the tricky question. I have been long thinking to improve the (Py)Arrow compatibility (but it seems I never opened an issue for this). Right now a geopandas -> pyarrow conversion doing But one of the reasons I hesitated doing that was because I didn't know the answer on which conversion best to choose .. ;) I am also thinking that, given there are multiple options, we should probably also create a public user-facing methods that does this arrow conversion, and where you can then specify which type conversion you want. In that case, you have an alternative for when the default in |
@kylebarron In addition (but I don't know if other libraries would actually make use of this), this could be a use case for |
After a bit of reflection, I agree and possibly we should hold off for a bit before implementing a default As a general question, where should geopandas -> geoarrow conversion live? Given that any geoarrow integration is limited to shapely methods and pyarrow is already an optional dependency, it seems like having an implementation in geopandas is reasonable. What about, as you said, an initial I've grown accustomed to working with arrow metadata on the (As an aside, does Ideally if/when shapely adds full geoarrow integration (shapely/shapely#1953), the In the longer term, options to customize coordinate type (interleaved vs separated) and row group size may be desired 1.
Yeah I'm not quite sure how that would work here. Footnotes
|
Is your feature request related to a problem?
In light of pandas-dev/pandas#56587, it would be awesome if GeoPandas were able to handle the Arrow PyCapsule Interface for interoperability with GeoArrow for reading or writing or both.
Describe the solution you'd like
As part of
geopandas.read_parquet
there exists an_arrow_to_geopandas
function and as part ofGeoDataFrame.to_parquet
there exists a_geopandas_to_arrow
function. It would be nice to have a public API for exporting to and (possibly) importing from GeoArrow data.Now that Pandas has included (pandas-dev/pandas#56587) an
__arrow_c_stream__
dunder, I think the best public API for exporting GeoArrow data is to add an__arrow_c_stream__
method to GeoDataFrame as well.Exporting data is simpler than importing data because third-party data is not guaranteed to have WKB-encoded geometries and shapely does not yet support arbitrary GeoArrow geometries (ref shapely/shapely#1953).
One question is whether the data export process should always use WKB for simplicity and interoperability or ever try to use
to_ragged_array
. In geoarrow/geoarrow-rs#477 I found that converting a GeoDataFrame to GeoArrow viato_ragged_array
could be up to 4x faster than converting to WKB and parsing the WKB.Importing data is more involved, due to the variation of encodings that GeoArrow supports. It may be best to leave this for a separate discussion?
API breaking implications
None.
Describe alternatives you've considered
This data conversion could be implemented in a separate library, however I think that there are good reasons to implement it here. The benefit of Arrow and the PyCapsule Interface specifically is that the ecosystem no longer needs to write
N * M
connectors (where every application needs to write direct support for every other application). E.g. in geoarrow/geoarrow-rs#477 and in lonboard I wrote a special cases for GeoDataFrames, but if we implement__arrow_c_stream__
directly on GeoDataFrame, then the need for special casing GeoDataFrames for Arrow will lessen over time.Outside of my own libraries other beneficiaries would include GDAL support. For example if/when OSGeo/gdal#9132 and/or geopandas/pyogrio#314 are implemented, if they look for an
__arrow_c_stream__
method on input, then you'd be able to pass aGeoDataFrame
to those functions directly, without the user needing to think about conversions.Additional context
The text was updated successfully, but these errors were encountered: