Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast-path implementation of point-to-point distance #2038

Open
martinfleis opened this issue Apr 15, 2024 · 5 comments
Open

fast-path implementation of point-to-point distance #2038

martinfleis opened this issue Apr 15, 2024 · 5 comments

Comments

@martinfleis
Copy link
Contributor

I was looking at the differences in performance of distance shown in this benchmark and found out, that sf has a special fast path implementation of distance computation if all geometries are points on a plane - https://github.com/r-spatial/sf/blob/32aa6c8c53005245493540b1359bc35fe50f0115/R/geom-measures.R#L200-L203

I think we could do the same, as if we measure the distance between points using numpy, we can get significant performance boosts. See below a replication of that benchmark using GEOS 3.12.1 and shapely main:

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points.reshape(-1, 1))
# 4 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords[:, None] - coords, axis=2)
# 328 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# but we can also be explicit, and faster
(shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, None][:, :, 0] - coords[:,0])**2 + (coords[:, None][:, :,1] - coords[:,1])**2)
# 71.2 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That is more than 50x faster than the GEOS version with identical results.

Would there be an appetite to get this implemented directly in shapely?

@theroggy
Copy link
Contributor

Somewhat related I noticed the following stackoverflow question that is somewhat related, but rather about geographic distances...
https://stackoverflow.com/questions/78324162/need-to-compute-distances-between-each-of-60-000-coordinates

I also wonder why it would/can/is faster that way than the implementation in GEOS. Maybe the implementation is GEOS could be improved?

@martinfleis
Copy link
Contributor Author

Given the GEOS API is designed to do distance between single points, it is the best to do this kind of optimisation here. See libgeos/geos#1066

@martinfleis
Copy link
Contributor Author

Also see libgeos/geos#1067

@jorisvandenbossche
Copy link
Member

Thanks for opening that upstream issue, that seems to have triggered a nice speed-up ;)

sf has a special fast path implementation of distance computation if all geometries are points on a plane - https://github.com/r-spatial/sf/blob/32aa6c8c53005245493540b1359bc35fe50f0115/R/geom-measures.R#L200-L203

Does sf keep track that all the geometries in a collection (dataframe) are of a certain type?

It would be _some_overhead to have to check this (although also not that big, below for a binary case, not matrix):

In [25]: import shapely
    ...: import numpy as np

In [26]: points = shapely.points(np.random.rand(4000, 2))
    ...: points2 = shapely.points(np.random.rand(4000, 2))

In [27]: %timeit shapely.distance(points, points2)
671 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: %timeit (shapely.get_type_id(points) == 0).all()
33.8 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Or it could be a specialized API? (distance_points, and let it be the responsibility of the user to know and/or verify this)

@martinfleis
Copy link
Contributor Author

Does sf keep track that all the geometries in a collection (dataframe) are of a certain type?

If I am reading that code correctly, sf checks whether the input is a subclass of sfc_POINT to determine the geometry type. sf reports geometry type when it prints a dataframe (see docs). So probably yes, they are likely aware of the type and do not need to check all geoms before passing them to distance function.

It would be _some_overhead to have to check this

To minimise this overhead for clearly non-point arrays, you can first check for the geom type of the first element and check all only if the first one is point. Though given how tiny it is, it may not be worth the effort.

I would prefer including this in distance rather than in a separate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants