Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

yuanx749 · 2024-04-25T11:56:08Z

According to the doc, if regex is None and pat length is not 1, treats pat as a regular expression.

closes BUG: Series.str.split broken with pyarrow strings and regex argument #58321(Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

WillAyd · 2024-04-28T12:24:15Z

pandas/core/arrays/arrow/array.py

@@ -2579,7 +2579,7 @@ def _str_split(
            n = None
        if pat is None:
            split_func = pc.utf8_split_whitespace
-        elif regex:
+        elif regex or (regex is None and len(pat) != 1):


Thanks for the PR - I'm not sure this is the right fix though. Do you see where the behavior deviates between the different string types? This current fix seems like it would apply a behavior change to all types

Thanks for the review. The behavior deviates here.

string[pyarrow] goes through

pandas/pandas/core/strings/object_array.py

Line 327 in a1fc8e8

def _str_split(

while pd.ArrowDtype(pa.string()) goes through

pandas/pandas/core/arrays/arrow/array.py

Line 2571 in a1fc8e8

def _str_split(

The docstring of str.split says this about regex: "If None and pat length is not 1, treats pat as a regular expression."

This behavior has been implemented in the first _str_split, but not in the second _str_split. So I add this condition in the second _str_split to fix the issue.

Ah OK thanks that is helpful. Is there a way to make these implementations look more alike? I see what you are trying to accomplish here but its hard to tell the corner cases where these may still diverge. Is there a reason why the implementations need to differ at all?

My initial intention was to make as few changes as possible.

To make it more coherent, I would rather set regex=True for the corner case before calling _str_split in the code below. Do you think it's OK?

pandas/pandas/core/strings/accessor.py

Lines 911 to 913 in a1fc8e8

if is_re(pat):

regex = True

result = self._data.array._str_split(pat, n, expand, regex)

I move outside the logic that determines if pat is a regex, so that the two _str_split look more alike. Coud you review again?

jorisvandenbossche · 2024-04-29T11:46:40Z

pandas/tests/extension/test_arrow.py

@@ -2296,6 +2296,16 @@ def test_str_split_pat_none(method):
    tm.assert_series_equal(result, expected)


+def test_str_split_regex_none():


Can you move this test to pandas/tests/strings/test_split_partition.py, so we can parametrize this with all the different string dtype implementations, ensuring the different ones all behave the same?

Done in the new commit.
But the tests look a bit ugly to me, because the expected output of pd.ArrowDtype(pa.string()) has different array dtype from the cases of other string dtypes. Maybe it's better to keep the test separate in test_arrow.py?

WillAyd · 2024-04-30T12:28:55Z

pandas/conftest.py

+        "string[python]",
+        pytest.param("string[pyarrow]", marks=td.skip_if_no("pyarrow")),
+        pytest.param("string[pyarrow_numpy]", marks=td.skip_if_no("pyarrow")),
+        pytest.param(pd.ArrowDtype(pa.string()), marks=td.skip_if_no("pyarrow")),


I would actually prefer to just add the pd.ArrowDtype(pa.string()) to the existing string dtypes instead of copying and creating a new fixture. Guessing that causes a lot of other test failures?

Ah, I forgot that pd.ArrowDtype(pa.string()) was not actually in the fixture, so my suggestion lead you a bit in the wrong way. Sorry!

Right now adding this to the main any_string_dtype fixture will indeed give quite some failures, yes. I agree that it might be better to actually do that (and it would be interesting to see which tests actually fail), but that's for another PR / out of scope for this bug fix (doing so would also require removing some tests are now only exist for the arrow string dtype, to not keep things duplicated).

Yes, a lot of other tests need to be adjusted if adding ArrowDtype to the fixture.

So for this PR, should I just add test in pandas/tests/extension/test_arrow.py?

Opened #58495 so we can track the larger issue

jorisvandenbossche · 2024-04-30T13:04:54Z

pandas/core/strings/object_array.py

-            elif regex is False:
-                new_pat = pat
-            # regex is None so link to old behavior #43563
            else:
-                if len(pat) == 1:
-                    new_pat = pat
-                else:
-                    new_pat = re.compile(pat)
+                new_pat = pat


I think we want to keep this, otherwise it would not be a bugfix for pd.ArrowDtype(pa.string()) but changing behaviour for all other string dtypes?

As discussed above #58418 (comment), the bug is caused by that this logic is not implemented in the _str_split method of pd.ArrowDtype(pa.string()). To fix it, and to make two _str_split implementations look more alike, I moved this logic outside before calling _str_split. So I think the behaviors for other dtypes have not been changed.

Moreover I think the existing tests have covered all combinations of parameters, and as long as they all pass, the behaviors should still be the same.

yuanx749 added 3 commits April 25, 2024 19:24

Add test

0349837

Fix

4948784

Add whatsnew

14c059c

yuanx749 marked this pull request as ready for review April 25, 2024 14:02

WillAyd requested changes Apr 28, 2024

View reviewed changes

jorisvandenbossche reviewed Apr 29, 2024

View reviewed changes

mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Apr 29, 2024

yuanx749 added 3 commits April 30, 2024 13:02

Merge remote-tracking branch 'upstream/main' into split-arrow

f55ca62

Move logic outside

f0c2097

Move test

6f93a8d

yuanx749 requested review from WillAyd and jorisvandenbossche April 30, 2024 08:57

yuanx749 force-pushed the split-arrow branch 2 times, most recently from 5031be7 to 6f93a8d Compare April 30, 2024 11:56

WillAyd reviewed Apr 30, 2024

View reviewed changes

jorisvandenbossche reviewed Apr 30, 2024

View reviewed changes

WillAyd mentioned this pull request Apr 30, 2024

BUG: Add pyarrow strings to any_string_dtype fixture #58495

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

yuanx749 commented Apr 25, 2024

WillAyd Apr 28, 2024

yuanx749 Apr 28, 2024

WillAyd Apr 28, 2024

yuanx749 Apr 28, 2024

yuanx749 Apr 30, 2024 •

edited

jorisvandenbossche Apr 29, 2024

yuanx749 Apr 30, 2024 •

edited

WillAyd Apr 30, 2024

jorisvandenbossche Apr 30, 2024 •

edited

yuanx749 Apr 30, 2024

WillAyd Apr 30, 2024

jorisvandenbossche Apr 30, 2024

yuanx749 Apr 30, 2024

	if is_re(pat):
	regex = True
	result = self._data.array._str_split(pat, n, expand, regex)

		@@ -2296,6 +2296,16 @@ def test_str_split_pat_none(method):
		tm.assert_series_equal(result, expected)


		def test_str_split_regex_none():

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Are you sure you want to change the base?

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Conversation

yuanx749 commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanx749 Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanx749 Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanx749 Apr 30, 2024 •

edited

yuanx749 Apr 30, 2024 •

edited

jorisvandenbossche Apr 30, 2024 •

edited