Add ability to use custom hash functions in `hashing.hash` and `Memory` #1232

judahrand · 2021-10-15T20:18:33Z

This Pull Request is a suggestion to improve #343.

It adds the ability to register and use custom hash functions. Additionally, it exposes the ability to choose which hash function is used to hash arguments in Memory.

import pandas as pd
import numpy as np
import scipy.sparse

import joblib
import timeit


rng = np.random.RandomState(42)
df = pd.DataFrame(rng.rand(100000, 100))
X = rng.rand(100000, 100)
X_csr = scipy.sparse.rand(1000, 10000, random_state=rng)

for data in [df, X, X_csr]:
    print('# {}, shape={}'.format(type(data).__name__, data.shape))
    print('MD5       joblib.hash          ', end='')
    print(timeit.timeit("joblib.hash(data, hash_name='md5')", globals=globals(), number=100))
    print('XXH3_64   joblib.hash          ', end='')
    print(timeit.timeit("joblib.hash(data, hash_name='xxh3_64')", globals=globals(), number=100))

# DataFrame, shape=(100000, 100)
MD5       joblib.hash          14.737673957999998
XXH3_64   joblib.hash          0.4058136670000021
# ndarray, shape=(100000, 100)
MD5       joblib.hash          14.70679475
XXH3_64   joblib.hash          0.37937962499999855
# coo_matrix, shape=(1000, 10000)
MD5       joblib.hash          0.3085604160000024
XXH3_64   joblib.hash          0.020856333000001115

codecov · 2021-10-15T20:21:27Z

Codecov Report

Merging #1232 (babb2da) into master (55d97ab) will increase coverage by 0.14%.
The diff coverage is 98.24%.

@@            Coverage Diff             @@
##           master    #1232      +/-   ##
==========================================
+ Coverage   88.68%   88.82%   +0.14%     
==========================================
  Files          47       47              
  Lines        7052     7106      +54     
==========================================
+ Hits         6254     6312      +58     
+ Misses        798      794       -4

Impacted Files	Coverage Δ
joblib/hashing.py	`91.60% <94.73%> (+0.37%)`	⬆️
joblib/memory.py	`95.44% <100.00%> (-0.19%)`	⬇️
joblib/test/test_hashing.py	`99.12% <100.00%> (+0.04%)`	⬆️
joblib/test/test_memory.py	`98.65% <100.00%> (+0.03%)`	⬆️
joblib/_parallel_backends.py	`96.09% <0.00%> (+2.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55d97ab...babb2da. Read the comment docs.

judahrand · 2021-10-15T21:01:21Z

Ideally, it would be nice to default to xxhash if available but I think this is likely to break people's caches when dependencies change. So probably best to just let people set it manually if they want it?

judahrand · 2021-10-15T21:17:27Z

@tomMoral I saw you reviewed the other PR which attempted this. As can be seen from the linked issue and my basic benchmarks the use here is for large Dataframe and array arguments.

tomMoral

Overall, the possibility to easily choose the hash library seems a nice addition and I would like to help merging this.

My main question is about the API design: should we require to register the hash function? I feel that the registration would not really be useful as if one instanciates Memory in 2 modules that can be imported separately, one would need to register the hash func in each of them (or import joblib in a parent module). Also, if 2 lib register the same hash, this would raise an error on import.

I think we could simplify the API by replacing the hash_name by a hash_func that can take a string or a callable. If the hash_func is a string, we simply call hashlib.new(hash_func) else call hash_func(). This way, we don't have to register the hash_func. This would simplify using the xxhash for instance with the following code:

import xxhash
mem = Memory(hash_func=xxhash.xxh3_64)

Also, as we add the possibility of changing the hash func, we could have collision between hash from different arguments. This comment by @ogrisel #343 (comment) propose a solution by adding a tag to the hash and raising a warning it does not match. Note that to avoid braing the backward compat, one would need to default to md5: if no tag is present.

tomMoral · 2021-10-18T03:32:55Z

joblib/memory.py

+        if hash_name not in hashing._HASHES:
+            raise ValueError("Valid options for 'hash_name' are {}. "
+                             "Got hash_name={!r} instead."
+                             .format(hash_name, hash_name))


Suggested change

.format(hash_name, hash_name))

.format(hashing._HASHES, hash_name))

tomMoral · 2021-10-18T03:42:49Z

joblib/test/test_hashing.py

@@ -495,3 +495,23 @@ def test_wrong_hash_name():
    with raises(ValueError, match=msg):
        data = {'foo': 'bar'}
        hash(data, hash_name='invalid')
+
+
+def test_right_regist_hash():


Suggested change

def test_right_regist_hash():

def test_right_register_hash():

tomMoral · 2021-10-18T03:58:50Z

joblib/hashing.py

+try:
+    import xxhash
+except ImportError:
+    xxhash = None


I would probably remove this part. The rational is that as this is not the default, users would mainly discover this through the doc, and we could explain how to register this when they create a Memory object?

Else, this results in an extra lib import on all processes, which might be costly ( not sure how heavy this lib is?)

jjerphan · 2022-02-02T16:59:36Z

Hi @judahrand, are you still working on this PR?

judahrand added 3 commits October 15, 2021 21:14

Add ability to register custom hash function

16f5f89

Register xxh32 if xxhash is installed

c715da3

Allow for selection of hash function in Memory

f2a7277

judahrand mentioned this pull request Oct 15, 2021

Speed up hashing of DataFrames and Series #1231

Open

judahrand force-pushed the custom-hasher branch from 52c9858 to 4065c14 Compare October 15, 2021 21:05

judahrand mentioned this pull request Oct 15, 2021

add the possibility to use custom hash function #1130

Open

judahrand force-pushed the custom-hasher branch 2 times, most recently from a4efea6 to d2f49b8 Compare October 15, 2021 21:15

judahrand force-pushed the custom-hasher branch 2 times, most recently from 57d7e1e to dc728bb Compare October 15, 2021 21:30

Add tests

f82cebb

judahrand force-pushed the custom-hasher branch from dc728bb to f82cebb Compare October 15, 2021 21:31

Use xxh3_64 instead of xxh32

babb2da

tomMoral requested changes Oct 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to use custom hash functions in `hashing.hash` and `Memory` #1232

Add ability to use custom hash functions in `hashing.hash` and `Memory` #1232

judahrand commented Oct 15, 2021 •

edited

codecov bot commented Oct 15, 2021 •

edited

judahrand commented Oct 15, 2021

judahrand commented Oct 15, 2021

tomMoral left a comment

tomMoral Oct 18, 2021

tomMoral Oct 18, 2021

tomMoral Oct 18, 2021

jjerphan commented Feb 2, 2022

	.format(hash_name, hash_name))
	.format(hashing._HASHES, hash_name))

	def test_right_regist_hash():
	def test_right_register_hash():

Add ability to use custom hash functions in hashing.hash and Memory #1232

Are you sure you want to change the base?

Add ability to use custom hash functions in hashing.hash and Memory #1232

Conversation

judahrand commented Oct 15, 2021 • edited

codecov bot commented Oct 15, 2021 • edited

Codecov Report

judahrand commented Oct 15, 2021

judahrand commented Oct 15, 2021

tomMoral left a comment

Choose a reason for hiding this comment

tomMoral Oct 18, 2021

Choose a reason for hiding this comment

tomMoral Oct 18, 2021

Choose a reason for hiding this comment

tomMoral Oct 18, 2021

Choose a reason for hiding this comment

jjerphan commented Feb 2, 2022

Add ability to use custom hash functions in `hashing.hash` and `Memory` #1232

Add ability to use custom hash functions in `hashing.hash` and `Memory` #1232

judahrand commented Oct 15, 2021 •

edited

codecov bot commented Oct 15, 2021 •

edited