Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize all pyarrow extension arrays efficiently #9740

Merged
merged 7 commits into from Dec 15, 2022

Conversation

jrbourbeau
Copy link
Member

This PR swaps out our custom logic for pickleing arrow-backed extension arrays with the implementation in the upcoming pandas=2 release (xref pandas-dev/pandas#49078). As discussed in #9613, the new implementation is much more straightforward, while being roughly as performant. It also applies to all ArrowExtensionArrays, not just ArrowStringArray.

I'll want to run the changes here against the notebook Ian provided in #9613 to make sure the performance benchmarks still hold, but the changes here should be ready for review.

cc @mroeschke for visibility

Closes #9613

def rebuild_arrowextensionarray(chunks):
array = pa.chunked_array(chunks)
if PANDAS_GT_150:
return pd.arrays.ArrowExtensionArray(array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For maximal backwards compat, if the data was pyarrow string I think it will be necessary to know if it was pd.StringDtype("pyarrow") vs pd.ArrowDtype. Because then the former should be constructed using ArrowStringArray while the later constructed via ArrowExtensionArray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @mroeschke. Is the reasoning for this due to the more-feature complete pyarrow back string implementation we talked about earlier in pandas-dev/pandas#50074 (comment)? Or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. If in 2.0 ArrowExtensionArray has feature parity with ArrowStringArray, I suppose you could always use ArrowExtensionArray since this is all internal dask serialization and not externally pickle per se?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this more and decided to just return whatever the original type is. Given that we're patching pickle, I think it makes sense to always have the same input / output type. I think this is consistent with your previous comment, but wanted to highlight just in case

Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after testing the implementation here against the notebook Ian put together (https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e) I uncovered some issues that have been resolved in the latest commit. Leaving a couple of comments to highlight what was wrong and what changed.

Comment on lines +43 to +44
for type_ in [pd.arrays.ArrowExtensionArray, pd.arrays.ArrowStringArray]:
copyreg.dispatch_table[type_] = reduce_arrowextensionarray
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When, available, we need to make sure to register copyreg entries for both pd.arrays.ArrowExtensionArray and pd.arrays.ArrowStringArray. This way, both pyarrow string implementation in pandas will pick up the serialization fixes here. I've added a test which makes sure we handle both pd.StringDtype("pyarrow") and pd.ArrowDtype(pa.string()) cases.

sliced_pickled = pickle.dumps(expected_sliced)

# Make sure slicing gives a large reduction in serialized bytes
assert len(full_pickled) > len(sliced_pickled) * 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this assert was assert len(full_pickled) > len(sliced_pickled). It turns out that even without any of the serialization patches here, this assert would still pass. This is because pickled sliced extensions arrays were still smaller than the pickled original extension array, but only a tiny bit smaller (e.g. 80.20 kiB for the sliced array vs. 80.29 kiB for the originally array). When we apply the copyreg patches, we see a much larger reduction in the serialized size (e.g. 799 B for the sliced array vs. 80.23 kiB for the original array).

I've gone ahead and added an extra factor of 3 here to mean "we see a significant reduction in the serialized size".

cc @mroeschke as I suspect the corresponding tests in pandas will see something similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the tip! I'll go ahead and make this fix stricter on pandas side too then just to validate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interestingly when testing pandas's similar test I am not seeing a huge difference

______________________________________________________________________ test_pickle_roundtrip ______________________________________________________________________

    @skip_if_no_pyarrow
    def test_pickle_roundtrip():
        # GH 42600
        expected = pd.Series(range(10), dtype="string[pyarrow]")
        expected_sliced = expected.head(2)
        full_pickled = pickle.dumps(expected)
        sliced_pickled = pickle.dumps(expected_sliced)

        # Testing that pickling the sliced object results in a _significant_ (2x)
        # reduction in serialized size
>       assert len(full_pickled) > len(sliced_pickled) * 2
E       assert 818 > (778 * 2)
E        +  where 818 = len(b"\x80\x04\x95'\x03\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\x8c..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.")
E        +  and   778 = len(b'\x80\x04\x95\xff\x02\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.')

pandas/tests/arrays/string_/test_string_arrow.py:213: AssertionError

In [1]: 818 / 778
Out[1]: 1.051413881748072

Still a reduction but I'll probably just leave this test alone on the pandas side.

@jrbourbeau jrbourbeau mentioned this pull request Dec 14, 2022
8 tasks
Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning to merge tomorrow if not further comments

@jrbourbeau jrbourbeau merged commit d2c9e39 into dask:main Dec 15, 2022
@jrbourbeau jrbourbeau deleted the pyarrow-extension-slice-pickle branch December 15, 2022 17:55
@ian-r-rose
Copy link
Collaborator

Thanks for this @jrbourbeau and @mroeschke!

@jrbourbeau
Copy link
Member Author

Thanks for all the initial work @ian-r-rose! That notebook was super useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use upstream pandas pickling protocol for pyarrow string series
3 participants