Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail and some others can work better if they'll use a length hints (pep 424) #510

Open
code-of-kpp opened this issue Mar 10, 2021 · 8 comments
Labels
pr-welcome We are open to PRs that fix this issue - leave a note if you're working on it!

Comments

@code-of-kpp
Copy link

code-of-kpp commented Mar 10, 2021

I understand, that recipes were ported directly from the docs, but things like tail do not perform great in some situations.

Note, that when deque is constructed it consumes almost everything from the iterator, so creation of tail iterator is not lazy.

A better possible implementation is:

def tail(n, iterable):
    if isinstance(iterable, collections.abc.Sequence):
        return iter(iterable[-n:])
    hint: int = operator.length_hint(iterable)
    if hint != 0 and n >= hint:
        return itertools.islice(iterable, hint - n, hint)
    return iter(collections.deque(iterable, maxlen=n))

This will make things like

>>> tail(100, range(1 << 30))
<range_iterator at 0x...>
>>> next(tail(100, range(1 << 30)))
1073741724

really fast, things like

>>> tail(100, reversed(range(1 << 30))) # lazy
<itertools.islice at 0x...>
>>> next(tail(100, reversed(range(1 << 30))))
99

somewhat faster (but at least lazy).

And hopefully will "automagically" improve in the future for some other cases like enumerate and map.

@code-of-kpp
Copy link
Author

Please, note, that this is not a PR, as length_hint is not guaranteed to be accurate (but it looks lite it is in practice), so actual implementation should consider other options of the length estimation, like isinstance(obj, Sized) and/or trusting length hints for types, that are known to have accurate hint.

@bbayles
Copy link
Collaborator

bbayles commented Mar 12, 2021

Thanks for the suggestion. I'm open to the "if it's a Sequence, slice it" optimization - you can make a PR for that? But because the "length hint" isn't necessarily true, I'd rather not rely on it.

@code-of-kpp
Copy link
Author

How about using abc.Sized to cover sets and dicts, for example and wrapping the call to deque in a yeild to force laziness?

@code-of-kpp
Copy link
Author

Another approach is to wrap deque call in map and chain or something like that. This way the return type of the function can be a little more transparent, e.g. range_iterator or islice in some cases.

Anyway this can be a breaking change as previously data were effectively copied and original iterator was exhausted before the next item picked from the tail(), forcing StopIteration and exiting loops with original iterator-parameter.

@bbayles bbayles added the pr-welcome We are open to PRs that fix this issue - leave a note if you're working on it! label Mar 14, 2021
@bbayles
Copy link
Collaborator

bbayles commented Oct 18, 2021

To be clear, the pr-welcome is for the tail function.

I'm open to the "if it's a Sequence, slice it" optimization

@nanouasyn
Copy link
Contributor

if isinstance(iterable, collections.abc.Sequence):
    return iter(iterable[-n:])

In this implementation, memory will be immediately allocated for all elements of the tail if we work with lists, strings or tuples. But maybe we're not going to use all the tail elements. For example, if then we plan to use takewhile or something like that. Maybe we should use islice for all Sized iterables?

I'm inclined to do this:

def tail(n, iterable):
    if isinstance(iterable, collections.abc.Sized):
        length = len(iterable)
        yield from islice(iterable, max(0, length - n), length)
    else:
        yield from deque(iterable, maxlen=n)

In this case, we will get the laziest version of tail. What do you think about it?

@rhettinger
Copy link
Contributor

because the "length hint" isn't necessarily true, I'd rather not rely on it.

This is a good decision. I see people misimplement iterators with a __length_hint__ that doesn't change as the iterator gets consumed or that fails to update when an underlying iterable is mutated.

I'm open to the "if it's a Sequence, slice it" optimization

This is a great idea and will greatly improve performance for short tails of long sequences (which is a typical use case).

Note, that when deque is constructed it consumes almost everything from the iterator, so creation of tail iterator is not lazy.

It is not clear to me that this is always (or even sometimes) better. The iterator protocol is split into two phases, iterator creation and iterator consumption, a setup phase and a take care of business phase.

If the two steps are done together, creating the iterator and consuming it immediately, then lazy evaluation has no effect on the user experience. Code in the following form will see zero benefit:

for line in tail(file):
      ...

If the two steps are not done together, a setup phase and later execution phase, it is reasonable to want to shift as much work as possible to the setup phase so that the execution phase is maximally responsive for the user. This is the same reason that we precompile regexes in templating engines — we want the pattern search to run as quickly as possible when requested.

Likewise, lazy evaluation precludes timely closing of the underlying resource:

with open(somefile) as source_file:
    it = tail(10, source_source)

...

for line in tail(it):   # This would fail since the file is already closed
      ...

So, this would be a breaking change. The mitigation would involve with_iter and leaving the file open during the interval between the setup and execution phase. Alternatively, a user would have to wrap the tail call with list to force eager evaluation so that they can close the underlying resource.

@rhettinger
Copy link
Contributor

One other thought: The Sized ABC only checks for the presence of __len__, so the usual (and faster) way is a try/except with len():

try:
    n = len(iterable)
except TypeError:
    deque version
else:
    islice version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-welcome We are open to PRs that fix this issue - leave a note if you're working on it!
Projects
None yet
Development

No branches or pull requests

4 participants