Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: support nan-like null strings in [l,r]strip #26392

Merged
merged 2 commits into from May 10, 2024

Conversation

ngoldbaum
Copy link
Member

Fixes #26390.

It turns out that strip has been missing from the stringdtype tests for unary and binary ufuncs and so the tests for null support were never triggered. I fixed that by adding all the strip-like functions to the tests.

Note that this adds another case where we have a goto next_step and an extra block to avoid compliation errors about jumping over variable declarations. I'd prefer to go through all of these and rewrite the loops to use for loops instead of while (N--), this just keeps things consistent with the rest of the file.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and can just put it in, since it just fixes a bug, comments are "would be nice to improve" really (not sure the OOM error path that looks wrong particularly important but happy to defer).

One other thing that struck me is: lstrip/rstrip should be pretty much identical except for calling a different core function.
That should really be consolidated either via C++ or via static data. But that is something to look in later I think.

else {
npy_gil_error(PyExc_ValueError,
"Cannot strip null values that are not strings");
"Can only strip null values that are strings "
"or NaN-like values");
goto fail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day might be nice to see if this check isn't common among a few ufuncs and just writing ufunc-name() cannot support ... with a single helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely some code duplication. I added this as a task to #25693.

@@ -1169,6 +1189,7 @@ string_lrstrip_whitespace_strided_loop(
npy_static_string s = {0, NULL};
int s_isnull = NpyString_load(allocator, ps, &s);


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

But doesn't matter.

if (NpyString_pack(oallocator, ops, new_buf, new_buf_size) < 0) {
npy_gil_error(PyExc_MemoryError, "Failed to pack string in %s",
ufunc_name);
goto fail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, missing the PyMem_RawFree(new_buf);? I probably said that before, but it would seem nice to avoid reallocating a small buffer every time here (maybe directly allocating the result although that isn't possible if the operation is in-place).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this error path indicates there's memory corruption or a bug in the implementation so a memory leak is the least of your worries if you get here. That said, easy enough to add a PyMem_RawFree, thanks for the catch.

Comment on lines 1495 to 1496
assert_array_equal(sizes, np.strings
.str_len(self.a))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_array_equal(sizes, np.strings
.str_len(self.a))
assert_array_equal(sizes, np.strings.str_len(self.a))

@ngoldbaum
Copy link
Member Author

ngoldbaum commented May 10, 2024

One other thing that struck me is: lstrip/rstrip should be pretty much identical except for calling a different core function.
That should really be consolidated either via C++ or via static data.

Are you talking about the separation between split_whitespace and split_chars? Because both variants consolidate the [l,r]strip distinction into a single strided loop, see the use of STRIPTYPE in stringdtype_ufuncs.cpp.

If it's strip_whitespace and strip_chars, I can take a look at is consolidating those two functions for stringdtype if it turns out the reason for the distinction is to support fixed-width strings but if my vague recollections from six months ago about this are correct we needed it because of limitations of the ufunc system around optional arguments.

@seberg
Copy link
Member

seberg commented May 10, 2024

Ah, nvm that part... might be nice to do something anyway, but I somehow had mindslipped into reading the second copy to be rsplit and the first lsplit.

@ngoldbaum
Copy link
Member Author

Thanks for giving this a look Sebastian. I'm self-merging because this fixes some tests in pandas (see https://github.com/pandas-dev/pandas/actions/runs/8971939049/job/24638804051?pr=58578 with a bunch of failures related to this) and it would be nice to have this in the RC.

@ngoldbaum ngoldbaum merged commit e86c581 into numpy:main May 10, 2024
65 checks passed
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: numpy.strings.lstrip/strip/rstrip fails with nulls
3 participants