-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize GeoArrow/Arrow IO #1953
base: main
Are you sure you want to change the base?
Conversation
Just for clarity of the scope of the custom code for shapely: all the .c/.h files are currently vendored (from https://github.com/geoarrow/geoarrow-c-geos and https://github.com/geoarrow/geoarrow-c), and essentially only |
Yes! I will try to also add the Shapely->GeoArrow half to this PR...in theory that is easier (no allocating GEOSGeometry!). It is possible (likely, I would say) that the intermittent crash that I'm observing is due to an error within geoarrow-c-geos...the tests for that are pretty minimal and I'd like to try to replicate the crash there where it's maybe easier to spot the issue. |
8135fc7
to
c0bdf68
Compare
shapely/geoarrow.py
Outdated
from shapely._geoarrow import ArrayReader, GeoArrowGEOSException # NOQA | ||
|
||
|
||
def from_arrow(arrays, schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we let arrays
be either a single array or multiple? In the single case we'd check for __arrow_c_array__
on the input directly?
In the future, this could check for __arrow_c_stream__
for a chunked array?
schema
could also be optional here, in which case you just assert that the schemas of each array chunk are the same? Or, really, there's no reason to disallow arrays with differing schemas, right? So it seems preferable to load the schema from __arrow_c_array__
anyways, unless instantiating ArrayReader
is expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right on all counts! I didn't get very far in the implementation before hitting a segfault...probably the final version will use nanoarrow.carray_stream()
to sanitize the input as an array stream, since that's basically what the array reader is designed to read (which would include anything that implements __arrow_c_array__
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now work! Now the signature is just from_arrow(array_or_stream)
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this needs a non-table stream, right? I.e. it'll error on struct input that doesn't conform to points? I suppose if you had a table with two float columns, x and y, that would "unintentionally work" because of overloading between record batches and struct columns? Is that something we would want to promote or not? If it had x
, y
, and some attribute
column, then it would fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will error for anything that isn't an extension type at the moment, although this is a deeper geoarrow-c (current) limitation. It would probably be fine to accept some non-extension storage types, but this gets hairy since multipoint/linestring share a storage type. But it should eventually (probably) accept a binary (WKB) and text (WKT) array.
1dcd790
to
fd31e2b
Compare
class Encoding(enum.Enum): | ||
WKB = SchemaCalculator.ENCODING_WKB | ||
WKT = SchemaCalculator.ENCODING_WKT | ||
GEOARROW = SchemaCalculator.ENCODING_GEOARROW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this default to separated encoding? Since GEOS is interleaved I have been defaulting to interleaved encoding for all my Python APIs (And I need interleaved for lonboard). But not sure which we should suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only just got to the point where we can test! I haven't been able to observe meaningful differences between the two in terms of timing, although for things like small arrays of very huge linestrings I wonder if it would pop through. The real benefit would be if GEOS made this type of export available internally, since then the overhead of all the individual C API calls wouldn't overwhelm the timing.
I think that the performance characteristics (e.g., write to file, filter, take, calculate bounding box) of the separated coordinate type (particularly for points) is better, so the default here is separated. I need to benchmark that to be sure, though.
For a library that lonboard that prefers interleaved values, you can always request them.
exporter = to_arrow(some_array)
// Not sure if pa.data_type() exists or if it handles generic __arrow_c_schema__ imports yet...
type = pa.data_type(exporter)
if type.geometry_type != ga.GeometryType.GEOMETRY:
pa.array(exporter, type.with_coord_type(ga.CoordType.INTERLEAVED))
else:
pa.array(exporter, type)
I think |
Good call! I was thinking that
|
That extra level of indirection seems like it would be annoying for most users. I'd be +1 on conditionally importing either pyarrow or nanoarrow and returning those directly. |
Definitely annoying. Probably |
if len(chunks) == 1: | ||
return requested_schema, chunks[0] | ||
else: | ||
raise ValueError( | ||
f"Can't export geometries to single chunk ({len(chunks)} required)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little weird to me that we have both __arrow_c_array__
and __arrow_c_stream__
on the same class. I'd have usually expected that there would be a single underlying data representation in a class, and that would inform which PyCapsule API is defined on that class.
Maybe it makes sense to have both ArrayBuilder
and ChunkedArrayBuilder
? Maybe there are some occasions where a user wants to have all geometries in a single array. I.e. does pyarrow.Table.from_pandas
always create a table with a single chunk? It looks like it does. In that case, exporting a non-chunked array of GeoArrow data would be important, so that the geometries could be appended on the table. And we'd presumably never want to raise an exception here.
(Ideally, I'd love pyarrow.Table.from_pandas
to optionally export a chunked Table, but alas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally they are the same unless the export is WKB and there's more than 2GB of WKB in the array. In that case, __arrow_c_array__
will error and __arrow_c_stream__
won't. In nanoarrow for Python, I try to almost always use __arrow_c_stream__
(and fall back to __arrow_c_array__
to generate a length-one stream).
I think it is fine to have both (nanoarrow's Array
class does this too)...if the caller needs or prefers contiguous memory, they can call __arrow_c_array__
(if they don't, they should probably call __arrow_c_stream__
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally they are the same
Even in the cases where there's a single chunk under the hood, I disagree that they're functionally the same because usually there are distinct APIs for contiguous and chunked arrays.
So far in my experience with the PyCapsule APIs, it's been expected that you can infer the storage format of the producer based on the dunder method that exists on the object. Is this not why pyarrow.Array
and pyarrow.RecordBatch
have __arrow_c_array__
implemented but not __arrow_c_stream__
, and vice versa for pyarrow.ChunkedArray
and pyarrow.Table
?
I very often use the presence of the dunder methods to decide on the most efficient way to operate on data, and I'd much prefer community consensus around whether both APIs should ever exist on a single class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty easy to separate and I don't mind doing it, although I think that you will almost always want to call the chunked version because it's not that hard to end up with >2GB of WKB. Calling the Array version is not great for the generic case because a downstream library will get errors when converting big (mixed) arrays.
Lonboard depends on pyarrow directly, so we can always rely on exporting to the pyarrow version. It would be lovely to remove this dependency (especially as that's the biggest blocker to getting lonboard to work in Wasm in jupyterlite/pyodide), but we rely heavily on |
We should maybe reimplement that in nanoarrow .. |
Or use nanoarrow to do it in Pandas? |
@paleolimbot apologies for the delay in response here. And thanks for exploring this! Some general concerns / thoughts:
|
Thanks for taking a look! It's definitely just an exploration, and one that has been helpful for me to flush out the scope of some of these projects.
That's fair...the reason it would "have to" be here is just because this is where GEOS lives in pip land (if I understand how that works). It could also live in GEOS, but that would only work for very new GEOS versions (if they even want this in the first place). I'm happy to look after this code as long as it lives here (and the review process may help me get familiar with enough with the internals to help elsewhere).
Sure! The type inference and export are completely independent. The version here ignores EMPTYs and has some POINT + MULTIPOINT -> MULTIPOINT logic too for a higher chance of success in getting a single output type. The logic should handle exporting length-one MULTIxxx to the simple type, too (but not much real-world testing yet).
Definitely...the benefits of moving this inside the C API boundary are definitely better. It would still require something that is basically geoarrow-c-geos and tests that are basically geoarrow-c-geos' tests. Even if this never gets merged, it is probably helpful for this PR to exist to demonstrate the utility of handling Arrow IO. I don't know if it would help GDAL (which I think has its own class hierarchy for geometries).
Great! I think everything that optimizes
Definitely! I would prefer just capsules for the interface + nanoarrow as optional (and only for running the tests). |
Getting closer! This now works in both directions.
Currently requires a development version of nanoarrow for Python: