New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up hashing of DataFrames and Series #1231
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1231 +/- ##
==========================================
- Coverage 94.90% 94.83% -0.08%
==========================================
Files 44 44
Lines 7308 7356 +48
==========================================
+ Hits 6936 6976 +40
- Misses 372 380 +8
☔ View full report in Codecov by Sentry. |
36a12cd
to
feabf7f
Compare
PandasHasher
to speed up hashing of DataFrame and Series arguments5f68e0a
to
deb997e
Compare
deb997e
to
7dcc5ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a nice change indeed! thx @judahrand
A couple of questions on to_feather
use and cache invalidation.
try: | ||
# This is by far the fastest way to serialize a Pandas object | ||
# but requires Pyarrow to be installed. | ||
obj.to_feather(buf) | ||
except (ImportError, ValueError): | ||
# If to_feather is not availiable, fall back to to_pickle. This | ||
# implementation seems to be much faster than the standard call | ||
# to Pickle. | ||
obj.to_pickle(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this one as this would mean that installing pyarrow
would break the cache?
Also, does changing to to_pickle
breaks the current cache or not?
Information in #581 might also be relevant for this PR. |
total abuse of the Pickler class. | ||
""" | ||
if isinstance(obj, self.pd.DataFrame): | ||
buf = io.BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a buffer creates a copy and is thus prohibitive. The best possible alternative would be to use pickle's protocol 5 (described by PEP 574) with a custom File Object (i.e. a class implementing the write
and close
methods) to avoid any memory copy. Such a File Object has to subclass the Pickler
class. This makes the contribution slightly harder.
Actually, I do not observe the same behavior anymore: the hashing is faster with import pandas as pd
import joblib
import timeit
import io
import hashlib
def hash_pandas(obj):
_hash = hashlib.new("md5")
buf = io.BytesIO()
obj.to_pickle(buf)
return _hash.update(buf.getvalue())
df = pd.DataFrame({k: np.random.randn(3000000) for k in 'abcde'})
print("with to_pickle:", timeit.timeit("hash_pandas(df)", globals=globals(), number=10) / 10)
print("with joblib hash:", timeit.timeit("joblib.hash(df, hash_name='md5')", globals=globals(), number=10) / 10) I get for
and for this branch (and arrow installed)
|
This Pull Request aims to address #343.
Using the internal Pandas serialization methods can be significantly faster than Pickle.
On master:
With changes (with PyArrow):
With changes (without PyArrow):
These results can be improved even further by using a faster hash function as made possible in #1232