Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Hasher trait for Hasher #87

Closed
wants to merge 1 commit into from
Closed

Conversation

Luro02
Copy link

@Luro02 Luro02 commented May 19, 2020

src/lib.rs Outdated Show resolved Hide resolved
@oconnor663
Copy link
Member

(Just dropping in to make a general apology for not being responsive this week. I'm ramping up at a new job. I will probably need to get to this PR over the weekend.)

@oconnor663
Copy link
Member

oconnor663 commented May 23, 2020

Could you say more about the use case here? blake3::Hash implements core::hash::Hash, so I'd expect most callers should finalize manually and use the Hash rather than the Hasher as a map key. (Note that finalize is a &self method, and you can update and finalize multiple times if necessary, as this PR does in finish.)

One of my worries is that blake3::Hasher is "very mutable". It's entire purpose as a type is to be mutated with update. Using it as a HashMap key makes it tempting to call update directly on a map's keys. But that would violate the HashMap contract and lead to weird logic bugs like keys disappearing. (Not technically undefined behavior, but still dangerous and very hard to debug.)

@Luro02
Copy link
Author

Luro02 commented May 24, 2020

Could you say more about the use case here?

When I was looking through the documentation I found the blake3::Hasher struct, which reminded me of the Hasher trait that has almost the same interface, so I thought it might be useful to have the trait implemented.

This trait makes it possible to hash any type implementing Hash with the blake3::Hasher, so one could for example create a blake3 hash for a struct, without converting it to bytes first:

use std::hash::Hash;
use std::hash::Hasher as _;

use blake3::Hasher;

#[derive(Hash)]
pub struct Example {
    pub field_0: String,
    pub field_1: usize,
}

fn main() {
    let example = Example {
        field_0: "Hello".into(),
        field_1: 123,
    };

    let mut hasher = Hasher::new();

    // this does only work when the `Hasher` trait is implemented for `Hasher`
    example.hash(&mut hasher);

    let hash = hasher.finish();

    assert_eq!(11927595719148982277, hash);
}

Sometimes it isn't even possible to convert a struct to bytes, in which case only the Hasher trait works.

blake3::Hash implements core::hash::Hash, so I'd expect most callers should finalize manually and use the Hash rather than the Hasher as a map key.

By using Hasher as a map key you mean something like this?:

use std::collections::HashMap;
use blake3::Hasher;

let mut map = HashMap::new();

map.insert(Hasher::new(), 0_usize);

(this is not possible, because blake3::Hasher would have to implement the Hash trait).

It is possible to use the blake3::Hasher to hash the HashMap keys through something like this:

use std::collections::HashMap;
use std::hash::BuildHasher;

/// This is a custom type implementing the `BuildHasher` trait,
/// which is required by `HashMap::with_hasher`
struct Blake3HasherBuilder;

impl BuildHasher for Blake3HasherBuilder {
    // NOTE: this only works if blake3::Hasher implements the `Hasher` trait
    type Hasher = blake3::Hasher;
    
    fn build_hasher(&self) -> Self::Hasher { blake3::Hasher::new() }
}

let mut map = HashMap::with_hasher(hash_builder);

map.insert("hello", "world"); // the "hello" key would be hashed with blake3::Hasher instead of the `DefaultHasher`

One of my worries is that blake3::Hasher is "very mutable". It's entire purpose as a type is to be mutated with update. Using it as a HashMap key makes it tempting to call update directly on a map's keys. But that would violate the HashMap contract and lead to weird logic bugs like keys disappearing. (Not technically undefined behavior, but still dangerous and very hard to debug.)

I think you might have confused the Hash and Hasher traits.

Hash

The Hash trait should be implemented by all types that can be hashed. For example

struct HelloNumber {
    hello: &'static str,
    number: usize,
}

can be hashed and therefore one would implement the Hash trait for it:

use std::hash::Hash;

impl Hash for HelloNumber {
    fn hash<H>(&self, state: &mut H)
    where
        H: std::hash::Hasher,
    {
        state.write(self.hello.as_bytes());
        state.write_usize(self.number);
    }
}

or one could simply derive it:

#[derive(Hash)]
struct HelloNumber {
    hello: &'static str,
    number: usize,
}

One is then able to use it wherever a type has to implement the Hash trait. For example as a HashMap key (requires the type to also implement Eq and PartialEq).

Hasher

The Hash trait is generic and allows to use any Hasher that implements the Hasher trait, which makes it possible to hash types, with any Hasher that I want, e.g. md5, sha1, sha256 or I could create my own hasher:

use std::hash::{Hasher, Hash};

// Each time you write some data to the `Hasher` the counter is incremented by one.
pub struct StupidHasher(pub u64);

impl Hasher for StupidHasher {
    fn finish(&self) -> u64 { self.0 }

    fn write(&mut self, _bytes: &[u8]) { self.0 += 1 }
}

fn main() {
    #[derive(Hash)]
    struct Example {
        pub field_1: &'static str,
        pub field_2: usize,
    }
    
    let example = Example {
        field_1: "hello".into(),
        field_2: 123
    };

    let mut stupid_hasher = StupidHasher(0);

    example.hash(&mut stupid_hasher);

    // 1st call: hash b"hello"
    // 2nd call: hash 0xFF 
    // dont know why, but this is how it is implemented:
    // https://doc.rust-lang.org/src/core/hash/mod.rs.html#611-616
    // 3rd call: hash 123_usize
    assert_eq!(stupid_hasher.0, 3);
    assert_eq!(stupid_hasher.0, stupid_hasher.finish());
}

@oconnor663
Copy link
Member

I think you might have confused the Hash and Hasher traits.

I definitely did! I need to reread this from the beginning.

@newpavlov
Copy link

newpavlov commented Jun 10, 2020

core::hash::Hasher trait is generally not intended for functions like MD-5, SHA-2, etc. This crate already implements digest and crypto-mac traits, so unless you plan to use BLAKE3 with hash tables I don't think Hasher trait should be implemented by it. Speed of BLAKE3 may make it suitable for hash tables, but it's worth to first compare performance with the default hasher (right now it's SipHash).

@oconnor663
Copy link
Member

oconnor663 commented Jun 21, 2020

Now that I understand better what's going on here, my biggest reservation is that std::hash::Hasher limits its output to 8 bytes. That's too small for most cryptographic purposes. (Though it works well for hash tables, which I think was the main (only?) design goal for the trait.) I think there's a real risk that users might not realize this limitation, and might think that the advertised cryptographic properties of BLAKE3 apply even when it's used through this interface, which could lead to security issues.

This trait makes it possible to hash any type implementing Hash with the blake3::Hasher, so one could for example create a blake3 hash for a struct, without converting it to bytes first:

Coincidentally, we've been thinking a lot about this exact problem at work. Uncanny timing! I think the problem of hashing structured data is really interesting and useful but also really subtle. (There must be rigorous papers on this, but I don't know of any. If anyone reading along can link to research, please do.) My tldr on this is that I think there are some commonly expected properties that the Rust standard library's approach is missing, so "do what the Rust standard library does" is unlikely to be a good solution.

Diving into that, here are some example questions:

  • Is the std::hash::Hash serialization of a structure guaranteed to be consistent across different machines? This is a requirement for any hash that might get published, for example by saving it in a file or writing it to a database. I think the standard library's implementation does not have this property. For example, integers are currently hashed according to their native endianness. Also the hash of a usize depends on the word size of the machine.
  • Is std::hash::Hash serialization guaranteed to be stable across different versions of the Rust compiler? This is also a requirement for any hash that might get published. I'm not aware of any cases where it's changed in practice, but also I don't think we have any guarantee that it won't.
  • What does it mean to hash an unordered container like a HashMap? Is it disallowed? Is it nondeterministic? Or does it imply sorting (and therefore an Ord requirement on the contents)? I think the standard library disallows this by not implementing std::hash::Hash for such containers.
  • What changes to a struct definition will change its hash? Changing its name? Changing the names of its fields? Changing the order of its fields? Can two structurally distinct objects give the same hash? Is there any way to extend a struct with new optional fields, while preserving its hash in cases where those fields are unset? (I believe Rust's answers are no, no, yes, yes, no.)

That very last question in particular, about whether new fields can be added in a backwards-compatible way, is a central design goal of Protobuf and a very useful property of many other serialization formats like JSON and MessagePack. I think that an Industrial Strength™ design for hashing structured data will need to take a position on this, and I think users are likely to need to know the details of that position and design around them.

@newpavlov
Copy link

In RustCrypto we had the following discussions about hashing structures some time ago:
RustCrypto/hashes#29
RustCrypto/utils#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants