Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow MultiPoint initialization #1954

Open
jobh opened this issue Dec 18, 2023 · 5 comments
Open

Very slow MultiPoint initialization #1954

jobh opened this issue Dec 18, 2023 · 5 comments

Comments

@jobh
Copy link

jobh commented Dec 18, 2023

Expected behavior and actual behavior.

I expect that initializing a large MultiPoint geometry is roughly as expensive as initializing a LineString from the same coordinates. Instead, it is ~2000x slower.

Steps to reproduce the problem.

(using ipython syntax for statement timing):

import numpy as np
import shapely

coords = np.random.uniform(size=(1000,2))

%timeit shapely.LineString(coords)
# --> 3.39 µs

%timeit shapely.MultiPoint(coords)
# --> 6.64 ms

Most of that extra time is in the non-vectorized check for empty geometries, which can be bypassed:

%timeit shapely.multipoints(coords)
# --> 119 µs

This is still a factor of ~50x slower. I haven't investigated why. But vectorizing that empty-geometry check seems like low-hanging fruit for a ~50x speedup.

Operating system

Ubuntu

Shapely version and provenance

Shapely v2.0.1 from conda-forge. I did check the latest main, it seems like the non-vectorized error check is still present.

@JamesParrott
Copy link

JamesParrott commented Jan 1, 2024

The non-vectorised check does indeed look like it could be improved (if only by iterating over points, not over a range).

for i in range(m):

But users can choose multipoints or MultiPoint, if the speedup is important to them.

As to the why, MultiPoint is typed for any sequence, whereas multipoints expects an array like argument. It (multipoints) is also wrapped in a multithreading decorator, and calls out to create_collection which is implemented in a C extension.

@jobh
Copy link
Author

jobh commented Jan 2, 2024

Thanks for the explanation @JamesParrott!

As you say, users can choose the faster variant, as I did. I guess there is no easy vectorization fix if the sequence can be heterogeneous.

Maybe it would help to put a pointer to this option in the documentation or docstring? Feel free to close the issue if desired, I opened it only because I was so surprised by the almost 2000x performance difference compared to LineString initialization.

@JamesParrott
Copy link

The docs talk about ufuncs at the start. But even so, if multipoints is a ufunc, yes indeed the docs could be much clearer. I would add something at the start of the Geometry / API section here:

https://shapely.readthedocs.io/en/stable/geometry.html

Otherwise, Shapely already forces users to install Numpy. The Shapely docs should prioritise the parts of the API that make the most of Numpy.

@jorisvandenbossche
Copy link
Member

Thanks for the issue! It seems this can be improved indeed. While you have the option to use shapely.multipoints() manually, we can easily improve MultiPoint(..) as well specifically in the case of a numpy array. We did a similar fix to the LineString(..) (and LinearRing/Polygon) constructor at #1602. Adding the same check to the MultiPoint constructor (ensuring for a numpy array we directly call multipoints(..) under the hood) speeds things up:

In [1]: import numpy as np
   ...: import shapely
   ...: 
   ...: coords = np.random.uniform(size=(1000,2))

In [2]: %timeit shapely.MultiPoint(coords)
11.7 ms ± 95.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- main branch
334 µs ± 6.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   # <-- with the patch

It's true that this is still slower than LineString, but I think that's an inherent difference: a LineString is a simpler geometry, backed by a single coordinate sequence under the hood, while a MultiPoint is a collection of Point objects, and thus more expensive to create.

yes indeed the docs could be much clearer. I would add something at the start of the Geometry / API section here:

Yes, that's a good point. That page could definitely mention the vectorized functions to create geometries. And in addition, also each docstring of Point/LineString/Polygon/.. etc could mention the equivalent ufunc.

@jorisvandenbossche
Copy link
Member

-> #1961 with the patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants