Serialize all `pyarrow` extension arrays efficiently #9740

jrbourbeau · 2022-12-09T22:38:07Z

This PR swaps out our custom logic for pickleing arrow-backed extension arrays with the implementation in the upcoming pandas=2 release (xref pandas-dev/pandas#49078). As discussed in #9613, the new implementation is much more straightforward, while being roughly as performant. It also applies to all ArrowExtensionArrays, not just ArrowStringArray.

I'll want to run the changes here against the notebook Ian provided in #9613 to make sure the performance benchmarks still hold, but the changes here should be ready for review.

cc @mroeschke for visibility

Closes #9613

mroeschke · 2022-12-12T19:35:38Z

dask/dataframe/_pyarrow_compat.py

+def rebuild_arrowextensionarray(chunks):
+    array = pa.chunked_array(chunks)
+    if PANDAS_GT_150:
+        return pd.arrays.ArrowExtensionArray(array)


For maximal backwards compat, if the data was pyarrow string I think it will be necessary to know if it was pd.StringDtype("pyarrow") vs pd.ArrowDtype. Because then the former should be constructed using ArrowStringArray while the later constructed via ArrowExtensionArray

Thanks for taking a look @mroeschke. Is the reasoning for this due to the more-feature complete pyarrow back string implementation we talked about earlier in pandas-dev/pandas#50074 (comment)? Or something else?

Yeah exactly. If in 2.0 ArrowExtensionArray has feature parity with ArrowStringArray, I suppose you could always use ArrowExtensionArray since this is all internal dask serialization and not externally pickle per se?

Thought about this more and decided to just return whatever the original type is. Given that we're patching pickle, I think it makes sense to always have the same input / output type. I think this is consistent with your previous comment, but wanted to highlight just in case

jrbourbeau

Okay, after testing the implementation here against the notebook Ian put together (https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e) I uncovered some issues that have been resolved in the latest commit. Leaving a couple of comments to highlight what was wrong and what changed.

jrbourbeau · 2022-12-13T21:50:24Z

dask/dataframe/_pyarrow_compat.py

+        for type_ in [pd.arrays.ArrowExtensionArray, pd.arrays.ArrowStringArray]:
+            copyreg.dispatch_table[type_] = reduce_arrowextensionarray


When, available, we need to make sure to register copyreg entries for both pd.arrays.ArrowExtensionArray and pd.arrays.ArrowStringArray. This way, both pyarrow string implementation in pandas will pick up the serialization fixes here. I've added a test which makes sure we handle both pd.StringDtype("pyarrow") and pd.ArrowDtype(pa.string()) cases.

jrbourbeau · 2022-12-13T21:59:58Z

dask/dataframe/tests/test_pyarrow_compat.py

+    sliced_pickled = pickle.dumps(expected_sliced)
+
+    # Make sure slicing gives a large reduction in serialized bytes
+    assert len(full_pickled) > len(sliced_pickled) * 3


Previously this assert was assert len(full_pickled) > len(sliced_pickled). It turns out that even without any of the serialization patches here, this assert would still pass. This is because pickled sliced extensions arrays were still smaller than the pickled original extension array, but only a tiny bit smaller (e.g. 80.20 kiB for the sliced array vs. 80.29 kiB for the originally array). When we apply the copyreg patches, we see a much larger reduction in the serialized size (e.g. 799 B for the sliced array vs. 80.23 kiB for the original array).

I've gone ahead and added an extra factor of 3 here to mean "we see a significant reduction in the serialized size".

cc @mroeschke as I suspect the corresponding tests in pandas will see something similar

Ah thanks for the tip! I'll go ahead and make this fix stricter on pandas side too then just to validate

Hmm interestingly when testing pandas's similar test I am not seeing a huge difference

______________________________________________________________________ test_pickle_roundtrip ______________________________________________________________________ @skip_if_no_pyarrow def test_pickle_roundtrip(): # GH 42600 expected = pd.Series(range(10), dtype="string[pyarrow]") expected_sliced = expected.head(2) full_pickled = pickle.dumps(expected) sliced_pickled = pickle.dumps(expected_sliced) # Testing that pickling the sliced object results in a _significant_ (2x) # reduction in serialized size > assert len(full_pickled) > len(sliced_pickled) * 2 E assert 818 > (778 * 2) E + where 818 = len(b"\x80\x04\x95'\x03\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\x8c..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.") E + and 778 = len(b'\x80\x04\x95\xff\x02\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.') pandas/tests/arrays/string_/test_string_arrow.py:213: AssertionError In [1]: 818 / 778 Out[1]: 1.051413881748072

Still a reduction but I'll probably just leave this test alone on the pandas side.

jrbourbeau

Planning to merge tomorrow if not further comments

ian-r-rose · 2022-12-15T22:00:32Z

Thanks for this @jrbourbeau and @mroeschke!

jrbourbeau · 2022-12-16T00:40:35Z

Thanks for all the initial work @ian-r-rose! That notebook was super useful

jrbourbeau added 3 commits December 9, 2022 13:20

Serialize all pyarrow extension arrays efficiently

7658059

Fix import

3c01865

More fixup

2267b3c

github-actions bot added the dataframe label Dec 9, 2022

mroeschke reviewed Dec 12, 2022

View reviewed changes

jrbourbeau added 2 commits December 12, 2022 14:44

Add StringDtype check

a3dbb84

Fixup

f1c2e9f

jrbourbeau commented Dec 13, 2022

View reviewed changes

jrbourbeau added 2 commits December 13, 2022 16:07

Skip test_pickle_roundtrip_pyarrow_string_implementations as needed

8347934

Pandas version compat

530ca90

jrbourbeau mentioned this pull request Dec 14, 2022

Release 2022.12.1 dask/community#297

Closed

8 tasks

jrbourbeau commented Dec 15, 2022

View reviewed changes

mroeschke approved these changes Dec 15, 2022

View reviewed changes

jrbourbeau merged commit d2c9e39 into dask:main Dec 15, 2022

jrbourbeau deleted the pyarrow-extension-slice-pickle branch December 15, 2022 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize all `pyarrow` extension arrays efficiently #9740

Serialize all `pyarrow` extension arrays efficiently #9740

jrbourbeau commented Dec 9, 2022

mroeschke Dec 12, 2022

jrbourbeau Dec 12, 2022

mroeschke Dec 13, 2022

jrbourbeau Dec 13, 2022

jrbourbeau left a comment

jrbourbeau Dec 13, 2022

jrbourbeau Dec 13, 2022

mroeschke Dec 15, 2022

mroeschke Dec 16, 2022

jrbourbeau left a comment

ian-r-rose commented Dec 15, 2022

jrbourbeau commented Dec 16, 2022

		for type_ in [pd.arrays.ArrowExtensionArray, pd.arrays.ArrowStringArray]:
		copyreg.dispatch_table[type_] = reduce_arrowextensionarray

Serialize all pyarrow extension arrays efficiently #9740

Serialize all pyarrow extension arrays efficiently #9740

Conversation

jrbourbeau commented Dec 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

ian-r-rose commented Dec 15, 2022

jrbourbeau commented Dec 16, 2022

Serialize all `pyarrow` extension arrays efficiently #9740

Serialize all `pyarrow` extension arrays efficiently #9740