Speed up hashing of DataFrames and Series #1231

judahrand · 2021-10-15T17:07:28Z

This Pull Request aims to address #343.

Using the internal Pandas serialization methods can be significantly faster than Pickle.

import pandas as pd

import joblib
import timeit


df = pd.read_parquet('dataframe.parquet')

print('# {}, shape={}'.format(type(df).__name__, data.shape))
print('MD5       joblib.hash          ', end='')
print(timeit.timeit("joblib.hash(df, hash_name='md5')", globals=globals(), number=10) / 10)

On master:

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          3.7768595833

With changes (with PyArrow):

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          0.061573899999999994

With changes (without PyArrow):

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          0.39262601660000007

These results can be improved even further by using a faster hash function as made possible in #1232

codecov · 2021-10-15T17:09:52Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.08 ⚠️

Comparison is base (2303143) 94.90% compared to head (4043750) 94.83%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1231      +/-   ##
==========================================
- Coverage   94.90%   94.83%   -0.08%     
==========================================
  Files          44       44              
  Lines        7308     7356      +48     
==========================================
+ Hits         6936     6976      +40     
- Misses        372      380       +8

Impacted Files	Coverage Δ
joblib/hashing.py	`92.64% <100.00%> (+1.41%)`	⬆️
joblib/test/common.py	`87.75% <100.00%> (+1.70%)`	⬆️
joblib/test/test_hashing.py	`99.14% <100.00%> (+0.07%)`	⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

tomMoral

This looks like a nice change indeed! thx @judahrand

A couple of questions on to_feather use and cache invalidation.

tomMoral · 2021-10-18T03:21:40Z

joblib/hashing.py

+            try:
+                # This is by far the fastest way to serialize a Pandas object
+                # but requires Pyarrow to be installed.
+                obj.to_feather(buf)
+            except (ImportError, ValueError):
+                # If to_feather is not availiable, fall back to to_pickle. This
+                # implementation seems to be much faster than the standard call
+                # to Pickle.
+                obj.to_pickle(buf)


I am not sure about this one as this would mean that installing pyarrow would break the cache?
Also, does changing to to_pickle breaks the current cache or not?

joblib/test/test_hashing.py

jjerphan · 2023-02-15T16:34:29Z

Information in #581 might also be relevant for this PR.

joblib/test/common.py

jjerphan · 2023-02-16T10:58:21Z

joblib/hashing.py

+            total abuse of the Pickler class.
+        """
+        if isinstance(obj, self.pd.DataFrame):
+            buf = io.BytesIO()


Using a buffer creates a copy and is thus prohibitive. The best possible alternative would be to use pickle's protocol 5 (described by PEP 574) with a custom File Object (i.e. a class implementing the write and close methods) to avoid any memory copy. Such a File Object has to subclass the Pickler class. This makes the contribution slightly harder.

tomMoral · 2023-05-29T01:06:34Z

Actually, I do not observe the same behavior anymore: the hashing is faster with master than with this branch for a DataFrame.
With the following script:

import pandas as pd
import joblib
import timeit

import io
import hashlib

def hash_pandas(obj):
    _hash = hashlib.new("md5")
    buf = io.BytesIO()
    obj.to_pickle(buf)
    return _hash.update(buf.getvalue())

df = pd.DataFrame({k: np.random.randn(3000000) for k in 'abcde'})

print("with to_pickle:", timeit.timeit("hash_pandas(df)", globals=globals(), number=10) / 10)
print("with joblib hash:", timeit.timeit("joblib.hash(df, hash_name='md5')", globals=globals(), number=10) / 10)

I get for master:

with to_pickle: 0.14789599039941095
with joblib hash: 0.128276090498548

and for this branch (and arrow installed)

with to_pickle: 0.15177922020084225
with joblib hash: 0.23214846260088962

judahrand force-pushed the pandas-hasher branch 2 times, most recently from 36a12cd to feabf7f Compare October 15, 2021 17:56

judahrand changed the title ~~Add PandasHasher to speed up hashing of DataFrame and Series arguments~~ Speed up hashing of DataFrames and Series Oct 15, 2021

judahrand force-pushed the pandas-hasher branch 2 times, most recently from 5f68e0a to deb997e Compare October 17, 2021 18:15

judahrand added 4 commits October 17, 2021 19:21

Add PandasHasher to hash Pandas objects fast

1ea1ae5

Add tests for DataFrames and Series

1d6b5f7

Fix linting

29ef9ef

Add pandas to CI test env

7dcc5ef

judahrand force-pushed the pandas-hasher branch from deb997e to 7dcc5ef Compare October 17, 2021 18:21

Catch ValueError which occurs with non-string column names

754a9bb

tomMoral reviewed Oct 18, 2021

View reviewed changes

jjerphan reviewed Feb 15, 2023

View reviewed changes

joblib/test/common.py Show resolved Hide resolved

jjerphan reviewed Feb 16, 2023

View reviewed changes

tomMoral added 2 commits May 29, 2023 02:31

Merge branch 'master' into pandas-hasher

5a89dbf

TST test cache invalidation with row/columns in pandas

4043750

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up hashing of DataFrames and Series #1231

Speed up hashing of DataFrames and Series #1231

judahrand commented Oct 15, 2021 •

edited

codecov bot commented Oct 15, 2021 •

edited

tomMoral left a comment

tomMoral Oct 18, 2021

jjerphan commented Feb 15, 2023

jjerphan Feb 16, 2023

tomMoral commented May 29, 2023

Speed up hashing of DataFrames and Series #1231

Are you sure you want to change the base?

Speed up hashing of DataFrames and Series #1231

Conversation

judahrand commented Oct 15, 2021 • edited

codecov bot commented Oct 15, 2021 • edited

Codecov Report

tomMoral left a comment

Choose a reason for hiding this comment

tomMoral Oct 18, 2021

Choose a reason for hiding this comment

jjerphan commented Feb 15, 2023

jjerphan Feb 16, 2023

Choose a reason for hiding this comment

tomMoral commented May 29, 2023

judahrand commented Oct 15, 2021 •

edited

codecov bot commented Oct 15, 2021 •

edited