fast-path implementation of point-to-point distance #2038

martinfleis · 2024-04-15T08:01:41Z

I was looking at the differences in performance of distance shown in this benchmark and found out, that sf has a special fast path implementation of distance computation if all geometries are points on a plane - https://github.com/r-spatial/sf/blob/32aa6c8c53005245493540b1359bc35fe50f0115/R/geom-measures.R#L200-L203

I think we could do the same, as if we measure the distance between points using numpy, we can get significant performance boosts. See below a replication of that benchmark using GEOS 3.12.1 and shapely main:

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points.reshape(-1, 1))
# 4 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords[:, None] - coords, axis=2)
# 328 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# but we can also be explicit, and faster
(shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, None][:, :, 0] - coords[:,0])**2 + (coords[:, None][:, :,1] - coords[:,1])**2)
# 71.2 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That is more than 50x faster than the GEOS version with identical results.

Would there be an appetite to get this implemented directly in shapely?

The text was updated successfully, but these errors were encountered:

theroggy · 2024-04-15T08:20:49Z

Somewhat related I noticed the following stackoverflow question that is somewhat related, but rather about geographic distances...
https://stackoverflow.com/questions/78324162/need-to-compute-distances-between-each-of-60-000-coordinates

I also wonder why it would/can/is faster that way than the implementation in GEOS. Maybe the implementation is GEOS could be improved?

martinfleis · 2024-04-15T15:11:52Z

Given the GEOS API is designed to do distance between single points, it is the best to do this kind of optimisation here. See libgeos/geos#1066

martinfleis · 2024-04-15T15:59:37Z

Also see libgeos/geos#1067

jorisvandenbossche · 2024-04-17T08:46:05Z

Thanks for opening that upstream issue, that seems to have triggered a nice speed-up ;)

sf has a special fast path implementation of distance computation if all geometries are points on a plane - https://github.com/r-spatial/sf/blob/32aa6c8c53005245493540b1359bc35fe50f0115/R/geom-measures.R#L200-L203

Does sf keep track that all the geometries in a collection (dataframe) are of a certain type?

It would be _some_overhead to have to check this (although also not that big, below for a binary case, not matrix):

In [25]: import shapely
    ...: import numpy as np

In [26]: points = shapely.points(np.random.rand(4000, 2))
    ...: points2 = shapely.points(np.random.rand(4000, 2))

In [27]: %timeit shapely.distance(points, points2)
671 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: %timeit (shapely.get_type_id(points) == 0).all()
33.8 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Or it could be a specialized API? (distance_points, and let it be the responsibility of the user to know and/or verify this)

martinfleis · 2024-04-17T09:21:36Z

Does sf keep track that all the geometries in a collection (dataframe) are of a certain type?

If I am reading that code correctly, sf checks whether the input is a subclass of sfc_POINT to determine the geometry type. sf reports geometry type when it prints a dataframe (see docs). So probably yes, they are likely aware of the type and do not need to check all geoms before passing them to distance function.

It would be _some_overhead to have to check this

To minimise this overhead for clearly non-point arrays, you can first check for the geom type of the first element and check all only if the first one is point. Though given how tiny it is, it may not be worth the effort.

I would prefer including this in distance rather than in a separate function.

martinfleis mentioned this issue Apr 15, 2024

More complex benchmarks kadyb/vector-benchmark#13

Open

martinfleis mentioned this issue Apr 15, 2024

Performance optimisation of point-to-point distance libgeos/geos#1066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast-path implementation of point-to-point distance #2038

fast-path implementation of point-to-point distance #2038

martinfleis commented Apr 15, 2024

theroggy commented Apr 15, 2024

martinfleis commented Apr 15, 2024

martinfleis commented Apr 15, 2024

jorisvandenbossche commented Apr 17, 2024

martinfleis commented Apr 17, 2024

fast-path implementation of point-to-point distance #2038

fast-path implementation of point-to-point distance #2038

Comments

martinfleis commented Apr 15, 2024

theroggy commented Apr 15, 2024

martinfleis commented Apr 15, 2024

martinfleis commented Apr 15, 2024

jorisvandenbossche commented Apr 17, 2024

martinfleis commented Apr 17, 2024