Optimize GeoArrow/Arrow IO #1953

paleolimbot · 2023-12-18T03:45:16Z

Getting closer! This now works in both directions.

Currently requires a development version of nanoarrow for Python:

pip install "https://github.com/paleolimbot/arrow-nanoarrow/archive/3041a02594b9d8955fb578924befbb2adb7e53d4.zip#egg=nanoarrow&subdirectory=python"

import pyogrio
import pyarrow as pa
import geoarrow.pyarrow as ga
import numpy as np
from shapely.geoarrow import to_pyarrow, from_arrow, infer_pyarrow_type, Encoding

# http://geoarrow.org/data
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line-wkb.arrow -o ns-water-water_line-wkb.arrow
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip -o file.zip
# ! unzip -d . file.zip
# ! rm file.zip
df = pyogrio.read_dataframe("ns-water-water_line.fgb")

%timeit pa.array(df.geometry.to_wkb())
#> 1.13 s ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from shapely.io import to_ragged_array
%timeit to_ragged_array(df.geometry)
#> 450 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit to_pyarrow(df.geometry, Encoding.WKB)
#> 190 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_pyarrow(df.geometry, Encoding.GEOARROW)
#> 220 ms ± 591 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_pyarrow(df.geometry, Encoding.GEOARROW_INTERLEAVED)
#> 217 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit infer_pyarrow_type(df.geometry)
#> 45 ms ± 790 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

chunked = to_pyarrow(df.geometry)
%timeit from_arrow(chunked)
# 407 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jorisvandenbossche · 2023-12-20T14:31:32Z

Just for clarity of the scope of the custom code for shapely: all the .c/.h files are currently vendored (from https://github.com/geoarrow/geoarrow-c-geos and https://github.com/geoarrow/geoarrow-c), and essentially only _geoarrow.pyx/geoarrow.py is new code here?

paleolimbot · 2023-12-20T15:38:13Z

only _geoarrow.pyx/geoarrow.py is new code here?

Yes! I will try to also add the Shapely->GeoArrow half to this PR...in theory that is easier (no allocating GEOSGeometry!).

It is possible (likely, I would say) that the intermittent crash that I'm observing is due to an error within geoarrow-c-geos...the tests for that are pretty minimal and I'd like to try to replicate the crash there where it's maybe easier to spot the issue.

kylebarron · 2024-01-06T22:02:53Z

shapely/geoarrow.py

+from shapely._geoarrow import ArrayReader, GeoArrowGEOSException  # NOQA
+
+
+def from_arrow(arrays, schema):


Can we let arrays be either a single array or multiple? In the single case we'd check for __arrow_c_array__ on the input directly?

In the future, this could check for __arrow_c_stream__ for a chunked array?

schema could also be optional here, in which case you just assert that the schemas of each array chunk are the same? Or, really, there's no reason to disallow arrays with differing schemas, right? So it seems preferable to load the schema from __arrow_c_array__ anyways, unless instantiating ArrayReader is expensive.

You're right on all counts! I didn't get very far in the implementation before hitting a segfault...probably the final version will use nanoarrow.carray_stream() to sanitize the input as an array stream, since that's basically what the array reader is designed to read (which would include anything that implements __arrow_c_array__).

This should now work! Now the signature is just from_arrow(array_or_stream) 🙂

Presumably this needs a non-table stream, right? I.e. it'll error on struct input that doesn't conform to points? I suppose if you had a table with two float columns, x and y, that would "unintentionally work" because of overloading between record batches and struct columns? Is that something we would want to promote or not? If it had x, y, and some attribute column, then it would fail?

I think it will error for anything that isn't an extension type at the moment, although this is a deeper geoarrow-c (current) limitation. It would probably be fine to accept some non-extension storage types, but this gets hairy since multipoint/linestring share a storage type. But it should eventually (probably) accept a binary (WKB) and text (WKT) array.

kylebarron · 2024-03-18T17:51:57Z

shapely/geoarrow.py

+class Encoding(enum.Enum):
+    WKB = SchemaCalculator.ENCODING_WKB
+    WKT = SchemaCalculator.ENCODING_WKT
+    GEOARROW = SchemaCalculator.ENCODING_GEOARROW


Does this default to separated encoding? Since GEOS is interleaved I have been defaulting to interleaved encoding for all my Python APIs (And I need interleaved for lonboard). But not sure which we should suggest.

This PR only just got to the point where we can test! I haven't been able to observe meaningful differences between the two in terms of timing, although for things like small arrays of very huge linestrings I wonder if it would pop through. The real benefit would be if GEOS made this type of export available internally, since then the overhead of all the individual C API calls wouldn't overwhelm the timing.

I think that the performance characteristics (e.g., write to file, filter, take, calculate bounding box) of the separated coordinate type (particularly for points) is better, so the default here is separated. I need to benchmark that to be sure, though.

For a library that lonboard that prefers interleaved values, you can always request them.

exporter = to_arrow(some_array) // Not sure if pa.data_type() exists or if it handles generic __arrow_c_schema__ imports yet... type = pa.data_type(exporter) if type.geometry_type != ga.GeometryType.GEOMETRY: pa.array(exporter, type.with_coord_type(ga.CoordType.INTERLEAVED)) else: pa.array(exporter, type)

kylebarron · 2024-03-18T17:54:07Z

I think to_pyarrow but from_arrow might be a little confusing for users. Do you expect a to_nanoarrow as well? Should we have to_arrow with a potential keyword argument in the future to switch arrow impl?

paleolimbot · 2024-03-18T19:05:51Z

Good call! I was thinking thatto_arrow() should return a GeoArrowExporter. Then, it's the caller's responsibility to invoke the dependency they had in mind.

exporter = to_arrow(some_array)
pa.array(exporter)
// geoarrow.rust.core.PointArray.from_arrow(exporter)
// nanoarrow.Array(exporter)
// pa.chunked_array(exporter)

kylebarron · 2024-03-18T19:31:31Z

Good call! I was thinking thatto_arrow() should return a GeoArrowExporter. Then, it's the caller's responsibility to invoke the dependency they had in mind.

That extra level of indirection seems like it would be annoying for most users. I'd be +1 on conditionally importing either pyarrow or nanoarrow and returning those directly.

paleolimbot · 2024-03-18T19:37:05Z

Definitely annoying. Probably to_nanoarrow() and/or to_pyarrow() then? Something like lonboard might want to instantiate the GeoArrowExporter directly to avoid a situation where the user has neither nanoarrow nor pyarrow installed.

kylebarron · 2024-03-18T19:57:55Z

shapely/geoarrow.py

+        if len(chunks) == 1:
+            return requested_schema, chunks[0]
+        else:
+            raise ValueError(
+                f"Can't export geometries to single chunk ({len(chunks)} required)"
+            )


It's a little weird to me that we have both __arrow_c_array__ and __arrow_c_stream__ on the same class. I'd have usually expected that there would be a single underlying data representation in a class, and that would inform which PyCapsule API is defined on that class.

Maybe it makes sense to have both ArrayBuilder and ChunkedArrayBuilder? Maybe there are some occasions where a user wants to have all geometries in a single array. I.e. does pyarrow.Table.from_pandas always create a table with a single chunk? It looks like it does. In that case, exporting a non-chunked array of GeoArrow data would be important, so that the geometries could be appended on the table. And we'd presumably never want to raise an exception here.

(Ideally, I'd love pyarrow.Table.from_pandas to optionally export a chunked Table, but alas)

Functionally they are the same unless the export is WKB and there's more than 2GB of WKB in the array. In that case, __arrow_c_array__ will error and __arrow_c_stream__ won't. In nanoarrow for Python, I try to almost always use __arrow_c_stream__ (and fall back to __arrow_c_array__ to generate a length-one stream).

I think it is fine to have both (nanoarrow's Array class does this too)...if the caller needs or prefers contiguous memory, they can call __arrow_c_array__ (if they don't, they should probably call __arrow_c_stream__).

Functionally they are the same

Even in the cases where there's a single chunk under the hood, I disagree that they're functionally the same because usually there are distinct APIs for contiguous and chunked arrays.

So far in my experience with the PyCapsule APIs, it's been expected that you can infer the storage format of the producer based on the dunder method that exists on the object. Is this not why pyarrow.Array and pyarrow.RecordBatch have __arrow_c_array__ implemented but not __arrow_c_stream__, and vice versa for pyarrow.ChunkedArray and pyarrow.Table?

I very often use the presence of the dunder methods to decide on the most efficient way to operate on data, and I'd much prefer community consensus around whether both APIs should ever exist on a single class.

See apache/arrow#40648

It's pretty easy to separate and I don't mind doing it, although I think that you will almost always want to call the chunked version because it's not that hard to end up with >2GB of WKB. Calling the Array version is not great for the generic case because a downstream library will get errors when converting big (mixed) arrays.

kylebarron · 2024-03-18T20:00:02Z

Something like lonboard might want to instantiate the GeoArrowExporter directly to avoid a situation where the user has neither nanoarrow nor pyarrow installed

Lonboard depends on pyarrow directly, so we can always rely on exporting to the pyarrow version.

It would be lovely to remove this dependency (especially as that's the biggest blocker to getting lonboard to work in Wasm in jupyterlite/pyodide), but we rely heavily on pyarrow.Table.from_pandas, which I don't really want to reimplement myself in Rust

jorisvandenbossche · 2024-03-19T18:57:37Z

but we rely heavily on pyarrow.Table.from_pandas, which I don't really want to reimplement myself in Rust

We should maybe reimplement that in nanoarrow ..

paleolimbot · 2024-03-19T18:59:07Z

We should maybe reimplement that in nanoarrow ..

Or use nanoarrow to do it in Pandas?

jorisvandenbossche · 2024-03-19T19:17:42Z

@paleolimbot apologies for the delay in response here. And thanks for exploring this!

Some general concerns / thoughts:

Shapely is a volunteer-only project with limited maintainer capacity (we are already too slow in reviewing and are not always able to keep up with new GEOS features), and this PR is adding a lot of code. I know part of it is vendored (but I don't know if that code is already used elsewhere as well?), but we still need to understand and know that (not much documented) vendored code well enough to review and maintain the shapely-specific code. I am a bit afraid about the current scope of code addition here.
Some parts of what is implemented here feels somewhat duplicative with existing functionality of shapely. For example, the type inference could maybe also be done with a shapely alternative based on np.unique(shapely.get_type_id(arr)) ? While this might mostly be part of the vendored code, from shapely's point of view it might still be duplicative additions
Further thinking about the scope, the main thing we want here is implement the GEOS-specific aspects about producing/consuming Arrow data (i.e. those parts where you need to build against GEOS). So it might also be an option to focus on that, and leave it at capsules, not bothering with pyarrow or geoarrow-pyarrow or nanoarrow or ..?
Should we advocate for moving some core functionality of this into GEOS itself? (that might also be useful for GDAL?)
I did have some old preliminary cython code that just optimizes to_ragged_array (Add cython algo to get offset arrays for flat coordinates representation pygeos/pygeos#93). From a quick test using your water_line test data, that's actually still a bit faster than this PR (but of course not yet that feature-full, like missing values support). Have to profile a bit in more detail to understand this better.

paleolimbot · 2024-03-19T20:00:40Z

Thanks for taking a look! It's definitely just an exploration, and one that has been helpful for me to flush out the scope of some of these projects.

I am a bit afraid about the current scope of code addition here.

That's fair...the reason it would "have to" be here is just because this is where GEOS lives in pip land (if I understand how that works). It could also live in GEOS, but that would only work for very new GEOS versions (if they even want this in the first place). I'm happy to look after this code as long as it lives here (and the review process may help me get familiar with enough with the internals to help elsewhere).

np.unique(shapely.get_type_id(arr))

Sure! The type inference and export are completely independent. The version here ignores EMPTYs and has some POINT + MULTIPOINT -> MULTIPOINT logic too for a higher chance of success in getting a single output type. The logic should handle exporting length-one MULTIxxx to the simple type, too (but not much real-world testing yet).

Should we advocate for moving some core functionality of this into GEOS itself?

Definitely...the benefits of moving this inside the C API boundary are definitely better. It would still require something that is basically geoarrow-c-geos and tests that are basically geoarrow-c-geos' tests. Even if this never gets merged, it is probably helpful for this PR to exist to demonstrate the utility of handling Arrow IO. I don't know if it would help GDAL (which I think has its own class hierarchy for geometries).

I did have some preliminary cython code that just optimizes to_ragged_array

Great! I think everything that optimizes to_ragged_array will help here (which can in turn help anywhere else that geoarrow-c-geos is useful, like perhaps in R). Another approach may be for geoarrow-c to just use nanoarrow + GEOS since these things are just a bag of buffers, after all.

leave it at capsules, not bothering with pyarrow or geoarrow-pyarrow or nanoarrow

Definitely! I would prefer just capsules for the interface + nanoarrow as optional (and only for running the tests).

paleolimbot force-pushed the arrow-io-maybe branch from 8135fc7 to c0bdf68 Compare January 6, 2024 20:22

kylebarron reviewed Jan 6, 2024

View reviewed changes

kylebarron mentioned this pull request Jan 24, 2024

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface geopandas/geopandas#3156

Open

paleolimbot mentioned this pull request Mar 14, 2024

ENH: Add export to GeoArrow geopandas/geopandas#3219

Merged

paleolimbot added 14 commits March 15, 2024 14:23

first pass

0b3cb43

maybe fix

ea8b0af

one more go on the build

8c83dcc

fix vendor

36da934

some building

ba4dcc1

works!

32796a0

pre-commit

d221003

fix test dep

8cf4fe9

maybe fix handle management

71bfee8

basic bits

5ea197d

working!

2fd6efe

start wiring up schema construction

daa5a5c

implement schema inferer

ad18c51

iterate faster

fd31e2b

paleolimbot force-pushed the arrow-io-maybe branch from 1dcd790 to fd31e2b Compare March 15, 2024 17:23

paleolimbot added 9 commits March 15, 2024 16:43

simplify chunk iteration

31dd71e

fix ieterator

4dfeeb6

use the iterator again

a517320

fix some warnings

96b4d82

manage own contexts

737ddce

fix crash

d578ccb

some optimizing of output

2b5b31c

use capsules

5ec665a

use more capsule stuff

bb1da20

paleolimbot added 2 commits March 16, 2024 23:02

more generic from_arrow()

008396d

format

5d217b4

paleolimbot mentioned this pull request Mar 17, 2024

Implementation of __arrow_c_array__ does not accept requested_schema geoarrow/geoarrow-rs#564

Closed

kylebarron reviewed Mar 18, 2024

View reviewed changes

kylebarron mentioned this pull request Mar 18, 2024

[Python] Conventions around PyCapsule Interface and choosing Array/Stream export apache/arrow#40648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GeoArrow/Arrow IO #1953

Optimize GeoArrow/Arrow IO #1953

paleolimbot commented Dec 18, 2023 •

edited

jorisvandenbossche commented Dec 20, 2023

paleolimbot commented Dec 20, 2023

kylebarron Jan 6, 2024

paleolimbot Jan 6, 2024

paleolimbot Mar 17, 2024 •

edited

kylebarron Mar 18, 2024 •

edited

paleolimbot Mar 18, 2024

kylebarron Mar 18, 2024

paleolimbot Mar 18, 2024

kylebarron commented Mar 18, 2024

paleolimbot commented Mar 18, 2024

kylebarron commented Mar 18, 2024

paleolimbot commented Mar 18, 2024

kylebarron Mar 18, 2024

paleolimbot Mar 18, 2024

kylebarron Mar 18, 2024 •

edited

paleolimbot Mar 19, 2024

kylebarron commented Mar 18, 2024 •

edited

jorisvandenbossche commented Mar 19, 2024

paleolimbot commented Mar 19, 2024

jorisvandenbossche commented Mar 19, 2024 •

edited

paleolimbot commented Mar 19, 2024

		from shapely._geoarrow import ArrayReader, GeoArrowGEOSException # NOQA


		def from_arrow(arrays, schema):

Optimize GeoArrow/Arrow IO #1953

Are you sure you want to change the base?

Optimize GeoArrow/Arrow IO #1953

Conversation

paleolimbot commented Dec 18, 2023 • edited

jorisvandenbossche commented Dec 20, 2023

paleolimbot commented Dec 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot Mar 17, 2024 • edited

Choose a reason for hiding this comment

kylebarron Mar 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Mar 18, 2024

paleolimbot commented Mar 18, 2024

kylebarron commented Mar 18, 2024

paleolimbot commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Mar 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Mar 18, 2024 • edited

jorisvandenbossche commented Mar 19, 2024

paleolimbot commented Mar 19, 2024

jorisvandenbossche commented Mar 19, 2024 • edited

paleolimbot commented Mar 19, 2024

paleolimbot commented Dec 18, 2023 •

edited

paleolimbot Mar 17, 2024 •

edited

kylebarron Mar 18, 2024 •

edited

kylebarron Mar 18, 2024 •

edited

kylebarron commented Mar 18, 2024 •

edited

jorisvandenbossche commented Mar 19, 2024 •

edited