Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Hash implementation #142

Open
orlp opened this issue Oct 2, 2023 · 5 comments
Open

Invalid Hash implementation #142

orlp opened this issue Oct 2, 2023 · 5 comments

Comments

@orlp
Copy link
Contributor

orlp commented Oct 2, 2023

Currently the Hash implementation just assumes that the FloatCore type has a mantissa/exponent width like f64. This is not necessarily true, one could have a valid custom FloatCore type that does not fit these constraints.

Furthermore, the hash implementation is rather slow, because it reconstructs the type from integer_decode (https://rust.godbolt.org/z/hTTqnaso6). That's 10-15 cycles every float for what should be a no-op - this can be more expensive than the hash function itself!

I propose we add a StdFloat sealed trait that is implemented for f32 and f64 only, and use that as the bound for the Hash implementation. Then the hash implementation can use .to_bits().

Unfortunately, this is a breaking change. However I must reiterate that the current implementation is invalid for the FloatCore trait bound, it's possible to make a perfectly sane type implementing FloatCore which would violate the Eq/Hash parity when used in OrderedFloat due to no fault of the original type.


An alternative solution is to feed the sign, mantissa and exponent returned by f.integer_decode() separately into the hasher. However this would make the hash implementation slower still.

@mbrubeck
Copy link
Collaborator

mbrubeck commented Oct 2, 2023

Currently the Hash implementation just assumes that the FloatCore type has a mantissa/exponent width like f64. This is not necessarily true, one could have a valid custom FloatCore type that does not fit these constraints.

Hmm. The FloatCore::integer_decode docs require that it “returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.”

We throw away the top few bits of the mantissa and exponent, which could be used in some custom implementations. This may increase collisions, but should not cause the Hash implementation to be incorrect.

Can you give an example of a correct FloatCore implementation that leads to an invalid Hash implementation?

I propose we add a StdFloat sealed trait that is implemented for f32 and f64 only, and use that as the bound for the Hash implementation. Then the hash implementation can use .to_bits().

Or if necessary we could implement Hash directly just for OrderedFloat<f32>, OrderedFloat<f64>, NotNan<f32>, and NotNan<f64>.

Furthermore, the hash implementation is rather slow

Ouch, this is unfortunate. I wish we had specialization so we could easily fix this without a breaking change...

@mbrubeck
Copy link
Collaborator

mbrubeck commented Oct 2, 2023

Can you give an example of a correct FloatCore implementation that leads to an invalid Hash implementation?

Ah, I guess one possibility is that the mantissa is not normalized, so that the number 2 (for example) could be represented in multiple ways like (1, 1, 1) and (2, 0, 1) and (4, -1, 1).

@mbrubeck
Copy link
Collaborator

mbrubeck commented Oct 2, 2023

I believe the generic implementation could be fixed by adding a normalization step, but this would make it even slower. In the absence of specialization, I am leaning toward your proposal of implementing Hash only for the primitive types, using to_bits. Feel free to submit a pull request, or I can do it when I have time.

I think the large majority of users only care about the primitive types, so performance in this common case is probably more important than support for very rare use cases. If there is demand for Hash for custom float types, we can add a way to opt in to that later.

@orlp
Copy link
Contributor Author

orlp commented Oct 2, 2023

We throw away the top few bits of the mantissa and exponent, which could be used in some custom implementations. This may increase collisions, but should not cause the Hash implementation to be incorrect.

You are right, extra collisions technically don't make it incorrect. Although I do such consider systematic collisions problematic. So my initial statement was overstated.

Ah, I guess one possibility is that the mantissa is not normalized, so that the number 2 (for example) could be represented in multiple ways like (1, 1, 1) and (2, 0, 1) and (4, -1, 1).

That said, you are right, this once again does make it technically incorrect.

Ouch, this is unfortunate. I wish we had specialization so we could easily fix this without a breaking change...

So do I... so do I...

Or if necessary we could implement Hash directly just for OrderedFloat<f32>, OrderedFloat<f64>, NotNan<f32>, and NotNan<f64>.

My biggest concern with this is that it breaks generic code of users in a way that's more annoying to fix. At least with a StdFloat trait the code can be made generic again relatively easy by adding a StdFloat bound.

@mbrubeck
Copy link
Collaborator

mbrubeck commented Dec 6, 2023

My current plan is to adopt your suggested fix, starting with ordered-float 5.0.0.

I’m waiting a bit to fix this in case there are other breaking changes we can bundle in the same release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants