string(-like)-specific split (and related methods?) #420

abarnert · 2020-04-14T22:41:31Z

I've got a program that iterates over string.split(sep), not expecting huge strings… and then someone feeds it a 30MB string, and it's suddenly using 240MB of extra memory. Perfect use case for (more-)itertools. Unfortunately, there doesn't seem to be a good replacement.

If I wanted to split on whitespace I could just use split_at(string, str.isspace). If I wanted to split on a single character, split_at(string, sep.__eq__). But what if I want to split on a (potentially) multi-character separator? You can do it with windowed (although I think it ends up getting easier to read with groupby than with any of the splitters), or you can do it with locate (which would even let you store the indices and create a randomly-accessible substring view), but I don't think there's a simple one-liner that's readable and has the same edge-case semantics as split. (Plus, whatever you write, for any of these cases, you still have to map ''.join over the results since you're hoping to iterate over strings.)

A general-purpose split_windowed or something should be doable. But it could be a lot simpler, and probably a lot faster, if it could rely on its input being sequences. And even more so if it could rely on its input being sufficiently string-like that it can use the find method (which includes str and all bytes-like types, even things like memoryview and mmap). I think either of those seems appropriate for more-itertools.

Here's something I found in one of my old projects for the string-like-only version, which passes all of the test_split tests in the CPython 3.8 stdlib test suite (although that doesn't test start and end):

def isplit(string, sep=None, maxsplit=-1, start=None, end=None):
    """Equivalent to
        iter(string[start:end].split(sep, maxsplit))

    ... but without building the list. Arguments sep and maxsplit have
    the same meaning as in str.split, and arguments start and end have
    the same meaning as in other methods like str.find. (That is, they
    are interpreted as in slice notation, including negative values.)
    """
    if start is None:
        start = 0
    elif start < 0:
        start += len(string)
    if end is None:
        end = len(string)
    elif end < 0:
        end += len(string)
    if sep is None:
        def find(start, end):
            for i in range(start, end):
                if string[i].isspace():
                    return i
            return -1
        seplen = 1
    elif not sep:
        raise ValueError('empty separator')
    else:
        seplen = len(sep)
        def find(start, end):
            return string.find(sep, start, end)
    while maxsplit:
        pos = find(start, end)
        if pos == -1:
            yield string[start:]
            return
        yield string[start:pos]
        start = pos+seplen
        maxsplit -= 1
    yield string[start:]

I couldn't find code lying around for the other two versions, but if you think one of them would fit better, I could probably write them. I think a function that only takes sequences, or only takes string-likes, but returns an iterator rather than a list, is still a good fit for more-itertools, like sliced and other examples.

For evidence that it's significantly faster even in the cases where split_at works, here's a (snipped) transcript (where string is the contents of the web2 dictionary):

$ ipython3
In [267]: %timeit consume(string.split('\n'))
23.1 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [268]: %timeit consume(isplit(string, '\n'))
106 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)    
In [270]: %timeit consume(''.join(g) for k, g in groupby(string, '\n'.__eq__) if not k)
241 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [273]: %timeit consume(''.join(w) for w in split_at(string, '\n'.__eq__))
329 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

$ ipypy3
In [8]: %timeit consume(string.split('\n'))
24.2 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit consume(isplit(string, '\n'))
46.8 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: %timeit consume(''.join(g) for k, g in groupby(string, '\n'.__eq__) if not k)
147 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit consume(''.join(w) for w in split_at(string, '\n'.__eq__))
95.9 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Of course the difference is less dramatic with the isspace case, and a lot bigger with multi-char separators (assuming I even got the windowed version right…). And of course when performance really matters you're going to want a C implementation (maybe just compiling the above in Cython?). But I think there are reasonable cases where a 4x slowdown in exchange for constant rather than 8N memory is worthwhile, but a 14x slowdown is pushing it.

I haven't thought too much about which other substring-ish functions could use similar treatment. Obviously rsplit. I don't think there's much point in fixed-count things like partition, startswith, etc., and I think find/index is already handled by locate, so it may be just split and rsplit.

I don't have a good thought on what name might fit well. split_string if it requires a string-like, split_slices if it only requires a sequence, split_windowed if it has to be fully general? I'm not sure.

The text was updated successfully, but these errors were encountered:

bbayles · 2020-04-15T01:57:15Z

Nice writeup. I'm hesitant to keep adding stuff to the split functions, because they've become complicated and difficult to reason about already. And I don't want this to be a string library.

That said, this is something I've run into and it would be nice to have a thing that did multiple-item splits on more than just strings.

What's your groupby idea look like? And is there a way to use replace (which has a window_size) and then one of the splits to do this incrementally?

abarnert · 2020-04-17T01:13:25Z

Apparently I didn't save my windowed-groupby implementation anywhere, so I'd have to rebuild it. The actual grouping is easy—just (g for k, g in groupby(windowed(string, len(sep)), tuple(sep).__eq__) if not k), but then you have to adjust the fenceposts and select the right chars from the right windows for the final result, and I can't remember why that ended up being more readable or faster than what I did with postprocessing split_at. And at any rate, it was still ugly and slow, just not as ugly and slow. (And I may not have come up with the best split_at design anyway.)

When I get a chance, I can try to build it again both ways, and also with replace. I suspect they're all going to be significantly harder to understand and slower (albeit less verbose) than looping over string.find, but maybe one of them will be good enough that it's worth using that (plus mapping ''.join) instead of getting a whole other third-party library (which may not even exist yet…) to do string-specific splitting.

bbayles added the pr-welcome We are open to PRs that fix this issue - leave a note if you're working on it! label Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string(-like)-specific split (and related methods?) #420

string(-like)-specific split (and related methods?) #420

abarnert commented Apr 14, 2020

bbayles commented Apr 15, 2020

abarnert commented Apr 17, 2020

string(-like)-specific split (and related methods?) #420

string(-like)-specific split (and related methods?) #420

Comments

abarnert commented Apr 14, 2020

bbayles commented Apr 15, 2020

abarnert commented Apr 17, 2020