Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up hashing of DataFrames and Series #1231

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

judahrand
Copy link

@judahrand judahrand commented Oct 15, 2021

This Pull Request aims to address #343.

Using the internal Pandas serialization methods can be significantly faster than Pickle.

import pandas as pd

import joblib
import timeit


df = pd.read_parquet('dataframe.parquet')

print('# {}, shape={}'.format(type(df).__name__, data.shape))
print('MD5       joblib.hash          ', end='')
print(timeit.timeit("joblib.hash(df, hash_name='md5')", globals=globals(), number=10) / 10)

On master:

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          3.7768595833

With changes (with PyArrow):

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          0.061573899999999994

With changes (without PyArrow):

# DataFrame, shape=(131712, 5)
MD5       joblib.hash          0.39262601660000007

These results can be improved even further by using a faster hash function as made possible in #1232

@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.08 ⚠️

Comparison is base (2303143) 94.90% compared to head (4043750) 94.83%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1231      +/-   ##
==========================================
- Coverage   94.90%   94.83%   -0.08%     
==========================================
  Files          44       44              
  Lines        7308     7356      +48     
==========================================
+ Hits         6936     6976      +40     
- Misses        372      380       +8     
Impacted Files Coverage Δ
joblib/hashing.py 92.64% <100.00%> (+1.41%) ⬆️
joblib/test/common.py 87.75% <100.00%> (+1.70%) ⬆️
joblib/test/test_hashing.py 99.14% <100.00%> (+0.07%) ⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@judahrand judahrand force-pushed the pandas-hasher branch 2 times, most recently from 36a12cd to feabf7f Compare October 15, 2021 17:56
@judahrand judahrand changed the title Add PandasHasher to speed up hashing of DataFrame and Series arguments Speed up hashing of DataFrames and Series Oct 15, 2021
@judahrand judahrand force-pushed the pandas-hasher branch 2 times, most recently from 5f68e0a to deb997e Compare October 17, 2021 18:15
Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice change indeed! thx @judahrand

A couple of questions on to_feather use and cache invalidation.

Comment on lines +272 to +280
try:
# This is by far the fastest way to serialize a Pandas object
# but requires Pyarrow to be installed.
obj.to_feather(buf)
except (ImportError, ValueError):
# If to_feather is not availiable, fall back to to_pickle. This
# implementation seems to be much faster than the standard call
# to Pickle.
obj.to_pickle(buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this one as this would mean that installing pyarrow would break the cache?
Also, does changing to to_pickle breaks the current cache or not?

joblib/test/test_hashing.py Show resolved Hide resolved
@jjerphan
Copy link
Contributor

Information in #581 might also be relevant for this PR.

total abuse of the Pickler class.
"""
if isinstance(obj, self.pd.DataFrame):
buf = io.BytesIO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a buffer creates a copy and is thus prohibitive. The best possible alternative would be to use pickle's protocol 5 (described by PEP 574) with a custom File Object (i.e. a class implementing the write and close methods) to avoid any memory copy. Such a File Object has to subclass the Pickler class. This makes the contribution slightly harder.

@tomMoral
Copy link
Contributor

Actually, I do not observe the same behavior anymore: the hashing is faster with master than with this branch for a DataFrame.
With the following script:

import pandas as pd
import joblib
import timeit

import io
import hashlib

def hash_pandas(obj):
    _hash = hashlib.new("md5")
    buf = io.BytesIO()
    obj.to_pickle(buf)
    return _hash.update(buf.getvalue())

df = pd.DataFrame({k: np.random.randn(3000000) for k in 'abcde'})

print("with to_pickle:", timeit.timeit("hash_pandas(df)", globals=globals(), number=10) / 10)
print("with joblib hash:", timeit.timeit("joblib.hash(df, hash_name='md5')", globals=globals(), number=10) / 10)

I get for master:

with to_pickle: 0.14789599039941095
with joblib hash: 0.128276090498548

and for this branch (and arrow installed)

with to_pickle: 0.15177922020084225
with joblib hash: 0.23214846260088962

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants