Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: use equals_identical for __eq__ #1760

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 13, 2023

Closes #1732, closes #1775

This might be a bit overkill for just implementing __eq__, but this should also allow to actually expose "equals_identical" as a ufunc for all GEOS versions (and not only for GEOS>=3.12). And it is also faster than the easier python alternative (but not sure how significant this will be in real-world cases):

geom = shapely.LineString(np.random.randn(1000, 2))

In [19]: %timeit geom == geom
10.8 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [20]: %timeit np.allclose(shapely.get_coordinates(geom), shapely.get_coordinates(geom), rtol=0, equal_nan=True)
76.5 µs ± 599 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jorisvandenbossche
Copy link
Member Author

Additional complication: I forgot that GEOSCoordSeq_copyToBuffer_r is only available for more recent GEOS versions (>= 3.10), so if we want to go this way, we need a fallback for older versions (we actually do already have a similar fallback implemented for the equivalent GEOSCoordSeq_copyFromBuffer_r).

The other reason that some tests are failing is because this PR actually changes behaviour regarding NaNs. On main, we don't regard such geometries as equal:

In [2]: import shapely

In [3]: line_string_nan = shapely.LineString([(np.nan, np.nan), (np.nan, np.nan)])

In [4]: line_string_nan
Out[4]: <LINESTRING (NaN NaN, NaN NaN)>

In [5]: line_string_nan == line_string_nan
Out[5]: False

while with this PR, I made NaNs in the same location as equal (following what was done for GEOSEqualsIdentical upstream in libgeos/geos#810).

With shapely 1.x, this also resulted in False, because we implemented __eq__ as tuple(self.coords) == tuple(other.coords), which ends up comparing the scalar coordinate values, and then nan != nan.

@idanmiara
Copy link
Contributor

@jorisvandenbossche Thanks for fixing this issue!

Copy link
Collaborator

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche

A few relatively minor comments on this so far; assume that this is still WIP waiting on GEOSCoordSeq_copyToBuffer_r.

Should this use GEOSEqualsIdentical if available from GEOS and skip the custom implementation here?

shapely/tests/geometry/test_equality.py Outdated Show resolved Hide resolved
shapely/tests/geometry/test_equality.py Outdated Show resolved Hide resolved
shapely/tests/geometry/test_equality.py Outdated Show resolved Hide resolved
shapely/tests/geometry/test_equality.py Outdated Show resolved Hide resolved
src/pygeos.c Outdated Show resolved Hide resolved
src/pygeos.c Outdated Show resolved Hide resolved
src/pygeos.c Show resolved Hide resolved
src/pygeos.c Outdated Show resolved Hide resolved
src/pygeos.c Outdated Show resolved Hide resolved
src/pygeos.c Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

@brendan-ward thanks a lot for the detailed review.
Higher level question: does that mean you would be OK with the behaviour change to consider NaNs in the same location in the coordinate sequence as equal? Because that's a deviation from what __eq__ does now, so something we should decide consciously.

@idanmiara
Copy link
Contributor

idanmiara commented Feb 14, 2023

@brendan-ward thanks a lot for the detailed review. Higher level question: does that mean you would be OK with the behaviour change to consider NaNs in the same location in the coordinate sequence as equal? Because that's a deviation from what __eq__ does now, so something we should decide consciously.

For me the proposed behavior makes much more sense, that NaNs would be considered equal.
Say for instance we can compare safely 3D points that happen to have Z=NaN assert Point(1,2,NaN) == Point(1,2,NaN)
I was actually going to rely on this equality on some project, and if they don't compare as equal I would need to workaround it.

@brendan-ward
Copy link
Collaborator

does that mean you would be OK with the behaviour change to consider NaNs in the same location in the coordinate sequence as equal?

Good to call me on it; I intentionally sidestepped answering that question. 😄
(was only suggesting comments to better indicate this direction)

I personally haven't relied on this behavior and thus would not be impacted either way (so take my perspective with a grain of salt), and find NaNs to be challenging in coordinate sequences throughout the codebase (I always go back to the comment in pygeos #306 of NaNs make everything bad, don't let them in). I think what you have done here is reasonable in that they are functionally equivalent if they occur at exactly the same position even though NaNs are not themselves comparable.

So long as this is clearly noted in the changelog, that hopefully avoids surprises for users, and I really hope very few if any are relying on specific comparisons where NaNs suddenly switch from incomparable (equals is false) to equivalent if functionally equal.

@jorisvandenbossche
Copy link
Member Author

Yes, fully agreed on that ideally you never have NaNs in coordinate sequences to start with, and we can improve APIs to prevent users from getting that.
However, if (for whatever reason) you have NaNs, one should be able to test if two geometries are identical (including their NaNs) in some way. The question then of course is whether __eq__ needs to be this way. If we add an equals_identical() function, people can also use that.

I also don't have real-world experience to base an opinion on, but I have the feeling it makes sense to let __eq__ be this "identical" check, since you have equals() for spatial equality and equals_exact() if you want to be able to specify some tolerance.
Whatever we choose, it's always somewhat ambiguous what __eq__ exactly does, because of it being an operator and not a function (also right now, users don't necessarily know whether it calls equals() or equals_exact())

@idanmiara
Copy link
Contributor

if (for whatever reason) you have NaNs, one should be able to test if two geometries are identical (including their NaNs) in some way. The question then of course is whether __eq__ needs to be this way.

(Hope I understood correctly) I strongly agree that __eq__ needs to return True on two geometries that are equal and have NaNs in the same locations, like Point(1,2,NaN) == Point(1,2,NaN)

@jorisvandenbossche
Copy link
Member Author

OK, updated this PR for the review comments. This is now "generally" passing, i.e. for most GEOS versions except for some corner cases where GEOS 3.8 seems to behave differently (for empty geometries, have to further diagnose this). The tests for GEOS main are failing, but those are failing on main as well (and related to libgeos/geos#832).

I also opened a dedicated issue for the API question regarding how NaNs are treated for visibility: #1775

@idanmiara
Copy link
Contributor

Hi @jorisvandenbossche is there an estimated release date for the next release including this important fix? Thanks!!!

@brendan-ward
Copy link
Collaborator

Thanks for the updates here @jorisvandenbossche . Can you please add a CHANGELOG entry?

The only part I'm still struggling with is the equality tests with Z nan values that are dependent on the GEOS version, where it returns False for GEOS <= 3.9 and GEOS >= 3.12, but True for GEOS 3.10-3.11. The tests document this behavior, which if I'm following correctly, is existing behavior independent of this PR. But since this doesn't have an obvious docstring where we could note this behavior, I wonder if maybe adding an Equality section to docs/geometry.rst might be the place to add such a note? (and also a place to clarify the 3 different types of equality tests: __eq__, equals (predicate), equals_exact).

src/pygeos.c Show resolved Hide resolved
@idanmiara
Copy link
Contributor

Hi @jorisvandenbossche @brendan-ward ,
Just wandering if any help is needed with this PR.
Please advise. 😄

jorisvandenbossche and others added 3 commits October 8, 2023 23:03
Co-authored-by: Idan Miara <idan@miara.com>
Co-authored-by: Idan Miara <idan@miara.com>
@jorisvandenbossche jorisvandenbossche changed the title REGR: fix __eq__ to compare for identical geometries (not ignoring z dimension) API: use equals_identical for __eq__ Oct 9, 2023
@coveralls
Copy link

coveralls commented Nov 26, 2023

Pull Request Test Coverage Report for Build 8233067448

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 3 of 3 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 88.057%

Totals Coverage Status
Change from base Build 8222997709: 0.2%
Covered Lines: 2588
Relevant Lines: 2939

💛 - Coveralls

@jorisvandenbossche
Copy link
Member Author

This is ready for another review.

@jorisvandenbossche
Copy link
Member Author

@mwtoews I added some minimal tests for (in)equality with M coordinates

Copy link
Member

@mwtoews mwtoews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have a few suggestions which you can take or leave.

shapely/predicates.py Outdated Show resolved Hide resolved
shapely/predicates.py Outdated Show resolved Hide resolved
@@ -203,6 +203,6 @@ extern enum ShapelyErrorCode coordseq_from_buffer(GEOSContextHandle_t ctx,
npy_intp cs2,
GEOSCoordSequence** coord_seq);
extern int coordseq_to_buffer(GEOSContextHandle_t ctx, const GEOSCoordSequence* coord_seq,
double* buf, unsigned int size, unsigned int dims);
double* buf, unsigned int size, int hasZ, int hasM);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a consistent naming convention:

Suggested change
double* buf, unsigned int size, int hasZ, int hasM);
double* buf, unsigned int size, int has_z, int has_m);

(same with src/geos.c)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed more consistent style here, but in several other places in the src files, we already use hasZ pattern, so going to leave this one as is for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be left for now and further discussed in #1648 as it can change later.

But to clarify my suggestion a bit more, the argument naming convention "foo_bar" is used in src/*.h header files, and even skimming header files from numpy.get_include() too. Data types and internal variables are fine to take whatever convention they wish (i.e. "fooBar").

src/pygeos.c Outdated Show resolved Hide resolved
jorisvandenbossche and others added 2 commits March 11, 2024 13:45
Co-authored-by: Mike Taves <mwtoews@gmail.com>
@jorisvandenbossche
Copy link
Member Author

@mwtoews thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants