New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize heapq.merge
with for-break(-else)
pattern?
#108880
Comments
Is this on debug or optimized build of Python? What are the results for short lists? What if one list is much shorter than others? What if merge 100 lists of different length (1, 2, ... 100 elements) in order of increasing or decreasing their length? Try to find examples for which the worst examples for the new code, and we will see whether such cases can be ignored. |
Extended benchmark with other types (converting the input lists to those types beforehand):
The last one is an iterator with a Python class Iterator:
def __init__(self, iterable):
self.iter = iter(iterable)
def __iter__(self):
return self
def __next__(self):
return next(self.iter) (Maybe if Full code:import random
from timeit import timeit
from statistics import mean, stdev
from collections import deque
import sys
from heapq import *
def merge_proposal(*iterables, key=None, reverse=False):
'''Merge multiple sorted inputs into a single sorted output.
Similar to sorted(itertools.chain(*iterables)) but returns a generator,
does not pull the data into memory all at once, and assumes that each of
the input streams is already sorted (smallest to largest).
>>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
[0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
If *key* is not None, applies a key function to each element to determine
its sort order.
>>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))
['dog', 'cat', 'fish', 'horse', 'kangaroo']
'''
h = []
h_append = h.append
if reverse:
_heapify = _heapify_max
_heappop = _heappop_max
_heapreplace = _heapreplace_max
direction = -1
else:
_heapify = heapify
_heappop = heappop
_heapreplace = heapreplace
direction = 1
if key is None:
for order, it in enumerate(map(iter, iterables)):
for value in it:
h_append([value, order * direction, it])
break
_heapify(h)
while len(h) > 1:
while True:
value, order, it = s = h[0]
yield value
for s[0] in it:
_heapreplace(h, s) # restore heap condition
break
else:
_heappop(h) # remove empty iterator
break
if h:
# fast case when only a single iterator remains
value, order, it = h[0]
yield value
yield from it
return
# Omitted the code for non-None key case
funcs = merge, merge_proposal
n = 10 ** 4
iterables = [
sorted(random.choices(range(n), k=n))
for _ in range(3)
]
expect = list(merge(*iterables))
for f in funcs:
result = list(f(*iterables))
print(result == expect, f.__name__)
class Iterator:
def __init__(self, iterable):
self.iter = iter(iterable)
def __iter__(self):
return self
def __next__(self):
return next(self.iter)
types = [
('list', list),
('tuple', tuple),
('dict', dict.fromkeys),
('deque', deque),
('generator', lambda iterable: (x for x in iterable)),
('string', lambda iterable: ''.join(map(chr, iterable))),
('class Iterator', Iterator),
]
for label, converter in types:
print(label)
times = {f: [] for f in funcs}
def stats(f):
ts = [t * 1e3 for t in sorted(times[f])[:5]]
return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
for _ in range(100):
for f in funcs:
converted = list(map(converter, iterables))
t = timeit(lambda: deque (f(*converted), 0), number=1)
times[f].append(t)
for f in sorted(funcs, key=stats):
print(stats(f), f.__name__)
print()
print('Python:', sys.version) |
I don't know. I don't have a functional PC at the moment, did this on the linked ATO site now. But I got similar results a while ago on my Windows PC with the standard installer from python.org.
How much (if this isn't covered by your next request)?
Increasing lengths:
Decreasing lengths:
Used setup: n = 10 ** 4
iterables = [
sorted(random.choices(range(n), k=length))
for length in range(1, 101)[::-1]
]
See "class Iterator" in the updated extended benchmark above: |
Noting that the proposal is actually much slower under PyPy (7.3.12 on 64-bit WIndows)., leaving them slower than under current CPython. That's not uncommon for "clever" code - PyPy likes code as straightforward and "primitive" as possible. Wish I could say something more helpful, but don't know enough. Offhand, PyPy is fiercely focused on optimizing loops, and a loop that - by design - only ever goes around at most once (
|
Sorry - fighting more illusions here! PyPy shows a very similar "slowdown" if I copy/paste the current CPython heapq.py's But I'm at a loss to account for where it comes from. CPython has never had an implementation like it, and I don't understand PyPy's workflow well enough to figure out how to access its checkin history. In any case, I was comparing two entirely different |
OK, apples to apples under PyPy. If I copy current CPython main |
Huh! I thought that code rang a bell. Looks like PyPy adopted a Footnotes
|
Thanks. From now on I'll also copy the original to ensure apples-to-apples. I've seen Dennis's efforts when I played with my own completely different alternatives. I might've given up on them when I saw the rejection of Dennis's, don't remember. Maybe there's hope for this one, as it's not completely different but just a small change. And I think it's not just faster but also nicer, at least the initialization and end parts. More so for the version with key: Current:
for order, it in enumerate(map(iter, iterables)):
try:
next = it.__next__
value = next()
h_append([key(value), order * direction, value, next])
except StopIteration:
pass
Proposal:
for order, it in enumerate(map(iter, iterables)):
for value in it:
h_append([key(value), order * direction, value, it])
break I imagine someone not used to such for-break(-else) usage (i.e., most people) might still prefer the current version, at least of the merging part. Although getting |
The pypy confusion is all on me. PyPy usually uses the Python code that ships with CPython, but not always, and I was careless in not checking that first in this case (indeed, it uses CPython's Every way of interleaving a collection of iterators some of which may "end early" has some kind of "boy - that sure looks weird at first!" wart. The single weirdest thing to me here is the current code's: yield from next.__self__ I've never used that, and don't expect I ever will, but it's obvious from context what it must do, and I trust that Raymond got the inscrutable details right 😄. Dennis was aiming at much bigger speedups, and at least under PyPy he got them. At least under-appreciated and perhaps wholly un-appreciated here: building a heap with tuple entries adds a steep time penalty because of the "tuple" part. While the heap code only cares about So, e.g., if we're merging ints, the merge code never does One thing I learned from implementing sorting is that If I take the current CPython "no-key" merge code, and fiddle it just a little to put instances of this class on the heap instead: class Item:
__slots__ = 'val', 'next'
def __init__(self, val, next):
self.val = val
self.next = next
def __lt__(x, y):
return x.val < y.val then the heap code calls This is part of why Dennis's code is so much quicker under PyPy too (it's also avoiding "triple comparison" overheads for embedding the real values of interest in tuples). But short of all that, your proposed code here looks like simple changes for modest speed gain. +0.5 from me. |
Some food for thought. Merging has some properties that heaps can be specialized for. One was mentioned before: find a way to convice the code to call Another: for reasons explained in code comments, for general use the Python heap code first moves heap "holes" all the way to leaf positions, and then puts the next value into that hole and bubbles it up. That's because for "random" entry orders, and overwhelmingly so in an in-place heap sort, "the next value" is most likely to end up closer to a leaf than to the root. But a different common way is to put the next value into the hole at once, and move it down a level only if one of the child nodes is smaller. This has a much better best case (could be the next value belongs in the hole at once), the same worst case (about 2 * lg n compares), and worse expected case for "random" (etc) insertion orders. But in the specific application of merging, we're replacing the smallest overall value with the next-smallest in the stream that winner came from. So it's probably close to the value we just yielded, and so probably belongs high in the heap. That makes the alternative way much more attractive. Indeed, the great attraction of tournament-style "loser trees" for merging is that replacement always takes exactly lg n compares. But the alternative heap method can get away with fewer at times. Anyway, how practical is this? Very, but under PyPy. The only thing that needs to change is the implementation of Under CPython it loses. That's because CPython is using a C-coded Here's the code. The only change to def heapaltreplace(heap, item):
endpos = len(heap)
pos = 0
# Don't do tuple compares! Compare the primary and secondary keys directly.
# And, as usual, stick to doing only __lt__ compares.
key1, key2 = item[:2]
childpos = 2*pos + 1 # leftmost child position
while childpos < endpos:
# Set childpos to index of smaller child, and ck1 & ck2 to
# that child's keys.
ck1, ck2 = heap[childpos][:2]
rightpos = childpos + 1
if rightpos < endpos:
rk1, rk2 = heap[rightpos][:2]
if rk1 < ck1 or (rk1 == ck1 and rk2 < ck2):
childpos = rightpos
ck1, ck2 = rk1, rk2
if ck1 < key1 or (ck1 == key1 and ck2 < key2):
heap[pos] = heap[childpos]
pos = childpos
childpos = 2*pos + 1
else:
break
# The slot at pos is empty now.
heap[pos] = item |
Those are some good ideas, and I've tried some more ideas myself and I'm quite interested to pursue them all later, but here I'll stick with just the for-break-else one, hoping it's small enough to actually get adopted. |
Feature or enhancement
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
https://discuss.python.org/t/optimize-heapq-merge-with-for-break-else-pattern/32931?u=pochmann
(@serhiy-storchaka said to do this on GitHub instead)
Proposal:
@rhettinger just mentioned a willingness for optimized
heapq.merge
:I propose to optimize
heapq.merge
by using what I call thefor-break(-else)
pattern in order to get only the next element of an iterator (using an unconditionalbreak
), which in my experience is significantly faster than callingnext()
or__next__
and can also lead to simpler code (depends on the case).Benchmark for merging three sorted lists of 10,000 elements each:
Here's a comparison with the current implementation:
Initialization: Put the non-empty iterators into the heap (note I include the iterator itself instead of its
__next__
):Merging while multiple iterators remain:
End when only one iterator remains:
Benchmark script:
Attempt This Online!
The text was updated successfully, but these errors were encountered: