ENH: add support for nan-like null strings in string replace #26355

ngoldbaum · 2024-04-27T02:19:25Z

This fixes an issue similar to the one fixed by #26353.

In particular, right now np.strings.replace calls the count ufunc to get the number of replacements. This is necessary for fixed-width strings, but it turns out to make it impossible to support null strings in replace.

I went ahead and instead found the replacement counts inline in the ufunc loop. This lets me add support for nan-like null strings, which it turns out pandas needs.

I marked this one as a backport and issued it separately from the other PR because the ufuncs fixed by the other PR aren't going to be in numpy 2.0.

mhvk

Looks good modulo some nitpicks...

mhvk · 2024-04-28T13:17:52Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+                        goto next_step;
+                    }
+                    else {
+                    npy_gil_error(PyExc_ValueError,


Indentation off

mhvk · 2024-04-28T13:18:27Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+                    }
+                    else {
+                    npy_gil_error(PyExc_ValueError,
+                                  "Only nan-like null values are not supported "


Delete "Only"?

Thanks, fixed the double-negative and tweaked the wording. Hopefully the version I just pushed reads better.

mhvk · 2024-04-28T13:20:24Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

-        Buffer<ENCODING::UTF8> buf2((char *)i2s.buf, i2s.size);
-        Buffer<ENCODING::UTF8> buf3((char *)i3s.buf, i3s.size);
-        Buffer<ENCODING::UTF8> outbuf(new_buf, max_size);
+        {


Why the new indentation? It already is in the loop.

(And it makes reviewing harder...)

It's because of the new use of goto next_step, I need to define a new lexical scope or define a bunch of variables at the top of the for loop that are only used at the bottom of it, otherwise the compiler complains about jumping over variable declarations.

I'd probably have gone for top of the for-loop myself, but no big deal...

While I don't hate the while (N--) loop in general, I do think a goto for loop control flow isn't nice and I much prefer a long for instead.
But this file has this pattern in a few places right now so it doesn't matter since the other places use this pattern also.

mhvk · 2024-04-28T13:22:22Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp


-        PyMem_RawFree(new_buf);
+            npy_int64 found_count = string_count<ENCODING::UTF8>(buf1, buf2, start, end);


I'd just hard code buf1, buf2, 0, NPY_MAX_INT64 - that seems clearer than defining variables that are only used here.

mhvk · 2024-04-29T19:51:31Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

-                                  "as search strings for replace");
+                        npy_gil_error(PyExc_ValueError,
+                                      "Only NaN-like null strings can be used "
+                                      "as as search strings for replace");


Now clearer, but this has a double "as as"

seberg

Looks good to me too, and if we are in a rush, we could put it in.

However, what we are missing are tests for the error paths, I think even the now fixed nan-like null path is untested?

Unless I mind-slipped, I also think the size calculation is odd and should try to use count now?

seberg · 2024-04-30T06:21:53Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+            }
+            else {
+                // replace i2 with i3
+                max_size = i1s.size * (i3s.size/i2s.size + 1);


That didn't change, but now that you have count you should use it, I think.

Also, am I confused by the division. It seems correct, but a bit overly complicated, since you can use i1s.size + difference giving:

change = i2.size >= i3.size ? 0 : i3.size - i2.size; max_size = i1s.size + count * change;

I.e. we replace at most count items (it might be less, if we can find overlaps with string_count. If overlaps are impossible in string_count then I guess the count might be exact).

Thanks. I agree this logic here is poorly motivated and using the count directly makes more sense.

seberg · 2024-04-30T06:25:42Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

-        }
+            npy_int64 found_count = string_count<ENCODING::UTF8>(
+                    buf1, buf2, 0, NPY_MAX_INT64);
+            if (found_count == -2) {


Suggested change

if (found_count == -2) {

if (found_count < 0) {

Yes, it returns -2 due to fastsearch, but let's clarify that it can't actually return -1

seberg · 2024-04-30T06:29:05Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

-        Buffer<ENCODING::UTF8> buf2((char *)i2s.buf, i2s.size);
-        Buffer<ENCODING::UTF8> buf3((char *)i3s.buf, i3s.size);
-        Buffer<ENCODING::UTF8> outbuf(new_buf, max_size);
+        {


While I don't hate the while (N--) loop in general, I do think a goto for loop control flow isn't nice and I much prefer a long for instead.
But this file has this pattern in a few places right now so it doesn't matter since the other places use this pattern also.

seberg · 2024-04-30T06:32:09Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+                    else {
+                        npy_gil_error(PyExc_ValueError,
+                                      "Only NaN-like null strings can be used "
+                                      "as search strings for replace");


(just a curious note for now)

I think default strings don't actually hit this, right? The only subtlety (which I don't care about), is that the we don't mutate the default string stored on the dtype probably, but rather insert the same string every time.

Ah good point; this error message isn't quite right, using a string as a missing string is also supported. Will update the error to match this.

Not sure what you're getting at about mutating strings, but that's why they're static strings that store the string data in a const buffer. Anyone mutating it is going out of their way to do so.

I was thinking of:

dt1 = StringDType(na_value="spam") replace(arr(..., dtype=dt1), "spam", "parrot")

doesn't give a StringDtype(na_value="parrot"), I think so "bloats" memory.

I don't mind that enough to worry (at least fo rnow, I think this is a niche feature)

EDIT: Sorry, first edit didn't use the same replacemnt as was the na_value... Also, to be clear, I am not sure that should happen!

ngoldbaum · 2024-04-30T19:05:52Z

numpy/_core/tests/test_stringdtype.py

@@ -1218,6 +1218,7 @@ def test_unary(string_array, unicode_array, function_name):
    "strip",
    "lstrip",
    "rstrip",
+    "replace"


@seberg this change makes sure the error paths are tested.

ngoldbaum · 2024-04-30T19:08:42Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+                    else {
+                        npy_gil_error(PyExc_ValueError,
+                                      "Only NaN-like null strings can be used "
+                                      "as search strings for replace");


Ah good point; this error message isn't quite right, using a string as a missing string is also supported. Will update the error to match this.

Not sure what you're getting at about mutating strings, but that's why they're static strings that store the string data in a const buffer. Anyone mutating it is going out of their way to do so.

ngoldbaum · 2024-04-30T19:09:52Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+            }
+            else {
+                // replace i2 with i3
+                max_size = i1s.size * (i3s.size/i2s.size + 1);


Thanks. I agree this logic here is poorly motivated and using the count directly makes more sense.

seberg · 2024-04-30T20:29:06Z

Thanks for following up on the count also!

…6355) This fixes an issue similar to the one fixed by numpy#26353. In particular, right now np.strings.replace calls the count ufunc to get the number of replacements. This is necessary for fixed-width strings, but it turns out to make it impossible to support null strings in replace. I went ahead and instead found the replacement counts inline in the ufunc loop. This lets me add support for nan-like null strings, which it turns out pandas needs.

ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label Apr 27, 2024

ngoldbaum requested a review from lysnikolaou April 27, 2024 02:19

github-actions bot added the 01 - Enhancement label Apr 27, 2024

mhvk reviewed Apr 28, 2024

View reviewed changes

ngoldbaum added 2 commits April 29, 2024 13:08

ENH: add support for nan-like null strings in replace

a8eca17

MNT: respond to PR feedback

4cc651b

ngoldbaum force-pushed the fix-replace-nulls branch from 79ea50d to 4cc651b Compare April 29, 2024 19:08

mhvk reviewed Apr 29, 2024

View reviewed changes

MNT: typo fix

e85e7a5

seberg reviewed Apr 30, 2024

View reviewed changes

ngoldbaum commented Apr 30, 2024

View reviewed changes

MNT: respond to sebastian's comments

ffec406

seberg merged commit 4e6d2bf into numpy:main Apr 30, 2024
65 checks passed

charris mentioned this pull request May 2, 2024

ENH: add support for nan-like null strings in string replace #26374

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add support for nan-like null strings in string replace #26355

ENH: add support for nan-like null strings in string replace #26355

ngoldbaum commented Apr 27, 2024

mhvk left a comment

mhvk Apr 28, 2024

mhvk Apr 28, 2024

ngoldbaum Apr 29, 2024

mhvk Apr 28, 2024

ngoldbaum Apr 29, 2024 •

edited

mhvk Apr 29, 2024

seberg Apr 30, 2024

mhvk Apr 28, 2024

mhvk Apr 29, 2024

seberg left a comment

seberg Apr 30, 2024

ngoldbaum Apr 30, 2024 •

edited

seberg Apr 30, 2024

seberg Apr 30, 2024

seberg Apr 30, 2024

ngoldbaum Apr 30, 2024

seberg Apr 30, 2024 •

edited

ngoldbaum Apr 30, 2024

ngoldbaum Apr 30, 2024

ngoldbaum Apr 30, 2024 •

edited

seberg commented Apr 30, 2024


		PyMem_RawFree(new_buf);
		npy_int64 found_count = string_count<ENCODING::UTF8>(buf1, buf2, start, end);

ENH: add support for nan-like null strings in string replace #26355

ENH: add support for nan-like null strings in string replace #26355

Conversation

ngoldbaum commented Apr 27, 2024

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum Apr 30, 2024 • edited

Choose a reason for hiding this comment

seberg commented Apr 30, 2024

ngoldbaum Apr 29, 2024 •

edited

ngoldbaum Apr 30, 2024 •

edited

seberg Apr 30, 2024 •

edited

ngoldbaum Apr 30, 2024 •

edited