Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GeoArrow/Arrow IO #1953

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link

@paleolimbot paleolimbot commented Dec 18, 2023

Getting closer! This now works in both directions.

Currently requires a development version of nanoarrow for Python:

pip install "https://github.com/paleolimbot/arrow-nanoarrow/archive/3041a02594b9d8955fb578924befbb2adb7e53d4.zip#egg=nanoarrow&subdirectory=python"
import pyogrio
import pyarrow as pa
import geoarrow.pyarrow as ga
import numpy as np
from shapely.geoarrow import to_pyarrow, from_arrow, infer_pyarrow_type, Encoding

# http://geoarrow.org/data
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line-wkb.arrow -o ns-water-water_line-wkb.arrow
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip -o file.zip
# ! unzip -d . file.zip
# ! rm file.zip
df = pyogrio.read_dataframe("ns-water-water_line.fgb")

%timeit pa.array(df.geometry.to_wkb())
#> 1.13 s ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from shapely.io import to_ragged_array
%timeit to_ragged_array(df.geometry)
#> 450 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit to_pyarrow(df.geometry, Encoding.WKB)
#> 190 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_pyarrow(df.geometry, Encoding.GEOARROW)
#> 220 ms ± 591 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_pyarrow(df.geometry, Encoding.GEOARROW_INTERLEAVED)
#> 217 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit infer_pyarrow_type(df.geometry)
#> 45 ms ± 790 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

chunked = to_pyarrow(df.geometry)
%timeit from_arrow(chunked)
# 407 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@jorisvandenbossche
Copy link
Member

Just for clarity of the scope of the custom code for shapely: all the .c/.h files are currently vendored (from https://github.com/geoarrow/geoarrow-c-geos and https://github.com/geoarrow/geoarrow-c), and essentially only _geoarrow.pyx/geoarrow.py is new code here?

@paleolimbot
Copy link
Author

only _geoarrow.pyx/geoarrow.py is new code here?

Yes! I will try to also add the Shapely->GeoArrow half to this PR...in theory that is easier (no allocating GEOSGeometry!).

It is possible (likely, I would say) that the intermittent crash that I'm observing is due to an error within geoarrow-c-geos...the tests for that are pretty minimal and I'd like to try to replicate the crash there where it's maybe easier to spot the issue.

from shapely._geoarrow import ArrayReader, GeoArrowGEOSException # NOQA


def from_arrow(arrays, schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let arrays be either a single array or multiple? In the single case we'd check for __arrow_c_array__ on the input directly?

In the future, this could check for __arrow_c_stream__ for a chunked array?

schema could also be optional here, in which case you just assert that the schemas of each array chunk are the same? Or, really, there's no reason to disallow arrays with differing schemas, right? So it seems preferable to load the schema from __arrow_c_array__ anyways, unless instantiating ArrayReader is expensive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right on all counts! I didn't get very far in the implementation before hitting a segfault...probably the final version will use nanoarrow.carray_stream() to sanitize the input as an array stream, since that's basically what the array reader is designed to read (which would include anything that implements __arrow_c_array__).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now work! Now the signature is just from_arrow(array_or_stream) 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this needs a non-table stream, right? I.e. it'll error on struct input that doesn't conform to points? I suppose if you had a table with two float columns, x and y, that would "unintentionally work" because of overloading between record batches and struct columns? Is that something we would want to promote or not? If it had x, y, and some attribute column, then it would fail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will error for anything that isn't an extension type at the moment, although this is a deeper geoarrow-c (current) limitation. It would probably be fine to accept some non-extension storage types, but this gets hairy since multipoint/linestring share a storage type. But it should eventually (probably) accept a binary (WKB) and text (WKT) array.

class Encoding(enum.Enum):
WKB = SchemaCalculator.ENCODING_WKB
WKT = SchemaCalculator.ENCODING_WKT
GEOARROW = SchemaCalculator.ENCODING_GEOARROW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this default to separated encoding? Since GEOS is interleaved I have been defaulting to interleaved encoding for all my Python APIs (And I need interleaved for lonboard). But not sure which we should suggest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only just got to the point where we can test! I haven't been able to observe meaningful differences between the two in terms of timing, although for things like small arrays of very huge linestrings I wonder if it would pop through. The real benefit would be if GEOS made this type of export available internally, since then the overhead of all the individual C API calls wouldn't overwhelm the timing.

I think that the performance characteristics (e.g., write to file, filter, take, calculate bounding box) of the separated coordinate type (particularly for points) is better, so the default here is separated. I need to benchmark that to be sure, though.

For a library that lonboard that prefers interleaved values, you can always request them.

exporter = to_arrow(some_array)
// Not sure if pa.data_type() exists or if it handles generic __arrow_c_schema__ imports yet...
type = pa.data_type(exporter)
if type.geometry_type != ga.GeometryType.GEOMETRY:
  pa.array(exporter, type.with_coord_type(ga.CoordType.INTERLEAVED))
else:
  pa.array(exporter, type)

@kylebarron
Copy link
Contributor

I think to_pyarrow but from_arrow might be a little confusing for users. Do you expect a to_nanoarrow as well? Should we have to_arrow with a potential keyword argument in the future to switch arrow impl?

@paleolimbot
Copy link
Author

Good call! I was thinking thatto_arrow() should return a GeoArrowExporter. Then, it's the caller's responsibility to invoke the dependency they had in mind.

exporter = to_arrow(some_array)
pa.array(exporter)
// geoarrow.rust.core.PointArray.from_arrow(exporter)
// nanoarrow.Array(exporter)
// pa.chunked_array(exporter)

@kylebarron
Copy link
Contributor

Good call! I was thinking thatto_arrow() should return a GeoArrowExporter. Then, it's the caller's responsibility to invoke the dependency they had in mind.

That extra level of indirection seems like it would be annoying for most users. I'd be +1 on conditionally importing either pyarrow or nanoarrow and returning those directly.

@paleolimbot
Copy link
Author

Definitely annoying. Probably to_nanoarrow() and/or to_pyarrow() then? Something like lonboard might want to instantiate the GeoArrowExporter directly to avoid a situation where the user has neither nanoarrow nor pyarrow installed.

Comment on lines +91 to +96
if len(chunks) == 1:
return requested_schema, chunks[0]
else:
raise ValueError(
f"Can't export geometries to single chunk ({len(chunks)} required)"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird to me that we have both __arrow_c_array__ and __arrow_c_stream__ on the same class. I'd have usually expected that there would be a single underlying data representation in a class, and that would inform which PyCapsule API is defined on that class.

Maybe it makes sense to have both ArrayBuilder and ChunkedArrayBuilder? Maybe there are some occasions where a user wants to have all geometries in a single array. I.e. does pyarrow.Table.from_pandas always create a table with a single chunk? It looks like it does. In that case, exporting a non-chunked array of GeoArrow data would be important, so that the geometries could be appended on the table. And we'd presumably never want to raise an exception here.

(Ideally, I'd love pyarrow.Table.from_pandas to optionally export a chunked Table, but alas)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally they are the same unless the export is WKB and there's more than 2GB of WKB in the array. In that case, __arrow_c_array__ will error and __arrow_c_stream__ won't. In nanoarrow for Python, I try to almost always use __arrow_c_stream__ (and fall back to __arrow_c_array__ to generate a length-one stream).

I think it is fine to have both (nanoarrow's Array class does this too)...if the caller needs or prefers contiguous memory, they can call __arrow_c_array__ (if they don't, they should probably call __arrow_c_stream__).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally they are the same

Even in the cases where there's a single chunk under the hood, I disagree that they're functionally the same because usually there are distinct APIs for contiguous and chunked arrays.

So far in my experience with the PyCapsule APIs, it's been expected that you can infer the storage format of the producer based on the dunder method that exists on the object. Is this not why pyarrow.Array and pyarrow.RecordBatch have __arrow_c_array__ implemented but not __arrow_c_stream__, and vice versa for pyarrow.ChunkedArray and pyarrow.Table?

I very often use the presence of the dunder methods to decide on the most efficient way to operate on data, and I'd much prefer community consensus around whether both APIs should ever exist on a single class.

See apache/arrow#40648

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty easy to separate and I don't mind doing it, although I think that you will almost always want to call the chunked version because it's not that hard to end up with >2GB of WKB. Calling the Array version is not great for the generic case because a downstream library will get errors when converting big (mixed) arrays.

@kylebarron
Copy link
Contributor

kylebarron commented Mar 18, 2024

Something like lonboard might want to instantiate the GeoArrowExporter directly to avoid a situation where the user has neither nanoarrow nor pyarrow installed

Lonboard depends on pyarrow directly, so we can always rely on exporting to the pyarrow version.

It would be lovely to remove this dependency (especially as that's the biggest blocker to getting lonboard to work in Wasm in jupyterlite/pyodide), but we rely heavily on pyarrow.Table.from_pandas, which I don't really want to reimplement myself in Rust

@jorisvandenbossche
Copy link
Member

but we rely heavily on pyarrow.Table.from_pandas, which I don't really want to reimplement myself in Rust

We should maybe reimplement that in nanoarrow ..

@paleolimbot
Copy link
Author

We should maybe reimplement that in nanoarrow ..

Or use nanoarrow to do it in Pandas?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 19, 2024

@paleolimbot apologies for the delay in response here. And thanks for exploring this!

Some general concerns / thoughts:

  • Shapely is a volunteer-only project with limited maintainer capacity (we are already too slow in reviewing and are not always able to keep up with new GEOS features), and this PR is adding a lot of code. I know part of it is vendored (but I don't know if that code is already used elsewhere as well?), but we still need to understand and know that (not much documented) vendored code well enough to review and maintain the shapely-specific code. I am a bit afraid about the current scope of code addition here.
  • Some parts of what is implemented here feels somewhat duplicative with existing functionality of shapely. For example, the type inference could maybe also be done with a shapely alternative based on np.unique(shapely.get_type_id(arr)) ? While this might mostly be part of the vendored code, from shapely's point of view it might still be duplicative additions
  • Further thinking about the scope, the main thing we want here is implement the GEOS-specific aspects about producing/consuming Arrow data (i.e. those parts where you need to build against GEOS). So it might also be an option to focus on that, and leave it at capsules, not bothering with pyarrow or geoarrow-pyarrow or nanoarrow or ..?
  • Should we advocate for moving some core functionality of this into GEOS itself? (that might also be useful for GDAL?)
  • I did have some old preliminary cython code that just optimizes to_ragged_array (Add cython algo to get offset arrays for flat coordinates representation pygeos/pygeos#93). From a quick test using your water_line test data, that's actually still a bit faster than this PR (but of course not yet that feature-full, like missing values support). Have to profile a bit in more detail to understand this better.

@paleolimbot
Copy link
Author

Thanks for taking a look! It's definitely just an exploration, and one that has been helpful for me to flush out the scope of some of these projects.

I am a bit afraid about the current scope of code addition here.

That's fair...the reason it would "have to" be here is just because this is where GEOS lives in pip land (if I understand how that works). It could also live in GEOS, but that would only work for very new GEOS versions (if they even want this in the first place). I'm happy to look after this code as long as it lives here (and the review process may help me get familiar with enough with the internals to help elsewhere).

np.unique(shapely.get_type_id(arr))

Sure! The type inference and export are completely independent. The version here ignores EMPTYs and has some POINT + MULTIPOINT -> MULTIPOINT logic too for a higher chance of success in getting a single output type. The logic should handle exporting length-one MULTIxxx to the simple type, too (but not much real-world testing yet).

Should we advocate for moving some core functionality of this into GEOS itself?

Definitely...the benefits of moving this inside the C API boundary are definitely better. It would still require something that is basically geoarrow-c-geos and tests that are basically geoarrow-c-geos' tests. Even if this never gets merged, it is probably helpful for this PR to exist to demonstrate the utility of handling Arrow IO. I don't know if it would help GDAL (which I think has its own class hierarchy for geometries).

I did have some preliminary cython code that just optimizes to_ragged_array

Great! I think everything that optimizes to_ragged_array will help here (which can in turn help anywhere else that geoarrow-c-geos is useful, like perhaps in R). Another approach may be for geoarrow-c to just use nanoarrow + GEOS since these things are just a bag of buffers, after all.

leave it at capsules, not bothering with pyarrow or geoarrow-pyarrow or nanoarrow

Definitely! I would prefer just capsules for the interface + nanoarrow as optional (and only for running the tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants