Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to use custom hash functions in hashing.hash and Memory #1232

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

judahrand
Copy link

@judahrand judahrand commented Oct 15, 2021

This Pull Request is a suggestion to improve #343.

It adds the ability to register and use custom hash functions. Additionally, it exposes the ability to choose which hash function is used to hash arguments in Memory.

import pandas as pd
import numpy as np
import scipy.sparse

import joblib
import timeit


rng = np.random.RandomState(42)
df = pd.DataFrame(rng.rand(100000, 100))
X = rng.rand(100000, 100)
X_csr = scipy.sparse.rand(1000, 10000, random_state=rng)

for data in [df, X, X_csr]:
    print('# {}, shape={}'.format(type(data).__name__, data.shape))
    print('MD5       joblib.hash          ', end='')
    print(timeit.timeit("joblib.hash(data, hash_name='md5')", globals=globals(), number=100))
    print('XXH3_64   joblib.hash          ', end='')
    print(timeit.timeit("joblib.hash(data, hash_name='xxh3_64')", globals=globals(), number=100))
# DataFrame, shape=(100000, 100)
MD5       joblib.hash          14.737673957999998
XXH3_64   joblib.hash          0.4058136670000021
# ndarray, shape=(100000, 100)
MD5       joblib.hash          14.70679475
XXH3_64   joblib.hash          0.37937962499999855
# coo_matrix, shape=(1000, 10000)
MD5       joblib.hash          0.3085604160000024
XXH3_64   joblib.hash          0.020856333000001115

@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Merging #1232 (babb2da) into master (55d97ab) will increase coverage by 0.14%.
The diff coverage is 98.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1232      +/-   ##
==========================================
+ Coverage   88.68%   88.82%   +0.14%     
==========================================
  Files          47       47              
  Lines        7052     7106      +54     
==========================================
+ Hits         6254     6312      +58     
+ Misses        798      794       -4     
Impacted Files Coverage Δ
joblib/hashing.py 91.60% <94.73%> (+0.37%) ⬆️
joblib/memory.py 95.44% <100.00%> (-0.19%) ⬇️
joblib/test/test_hashing.py 99.12% <100.00%> (+0.04%) ⬆️
joblib/test/test_memory.py 98.65% <100.00%> (+0.03%) ⬆️
joblib/_parallel_backends.py 96.09% <0.00%> (+2.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55d97ab...babb2da. Read the comment docs.

@judahrand
Copy link
Author

Ideally, it would be nice to default to xxhash if available but I think this is likely to break people's caches when dependencies change. So probably best to just let people set it manually if they want it?

@judahrand
Copy link
Author

@tomMoral I saw you reviewed the other PR which attempted this. As can be seen from the linked issue and my basic benchmarks the use here is for large Dataframe and array arguments.

@judahrand judahrand force-pushed the custom-hasher branch 2 times, most recently from 57d7e1e to dc728bb Compare October 15, 2021 21:30
Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the possibility to easily choose the hash library seems a nice addition and I would like to help merging this.

My main question is about the API design: should we require to register the hash function? I feel that the registration would not really be useful as if one instanciates Memory in 2 modules that can be imported separately, one would need to register the hash func in each of them (or import joblib in a parent module). Also, if 2 lib register the same hash, this would raise an error on import.

I think we could simplify the API by replacing the hash_name by a hash_func that can take a string or a callable. If the hash_func is a string, we simply call hashlib.new(hash_func) else call hash_func(). This way, we don't have to register the hash_func. This would simplify using the xxhash for instance with the following code:

import xxhash
mem = Memory(hash_func=xxhash.xxh3_64)

Also, as we add the possibility of changing the hash func, we could have collision between hash from different arguments. This comment by @ogrisel #343 (comment) propose a solution by adding a tag to the hash and raising a warning it does not match. Note that to avoid braing the backward compat, one would need to default to md5: if no tag is present.

if hash_name not in hashing._HASHES:
raise ValueError("Valid options for 'hash_name' are {}. "
"Got hash_name={!r} instead."
.format(hash_name, hash_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.format(hash_name, hash_name))
.format(hashing._HASHES, hash_name))

@@ -495,3 +495,23 @@ def test_wrong_hash_name():
with raises(ValueError, match=msg):
data = {'foo': 'bar'}
hash(data, hash_name='invalid')


def test_right_regist_hash():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_right_regist_hash():
def test_right_register_hash():

Comment on lines +18 to +21
try:
import xxhash
except ImportError:
xxhash = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably remove this part. The rational is that as this is not the default, users would mainly discover this through the doc, and we could explain how to register this when they create a Memory object?

Else, this results in an extra lib import on all processes, which might be costly ( not sure how heavy this lib is?)

@jjerphan
Copy link
Contributor

jjerphan commented Feb 2, 2022

Hi @judahrand, are you still working on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants