Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive consider-using-tuple for x in set, because frozen sets are faster #4776

Closed
samueltardieu opened this issue Jul 30, 2021 · 22 comments · Fixed by #4841
Closed
Labels
False Positive 🦟 A message is emitted but nothing is wrong with the code performance
Milestone

Comments

@samueltardieu
Copy link

samueltardieu commented Jul 30, 2021

Bug description

The change introduced in #4768 is unwarranted.

Considering the following code:

def f(a):
    if a in {"abc", "def", "ghi", "klm", "nop"}:
        print(a)

def g():
    for a in {"abc", "def", "ghi", "klm", "nop"}:
        print(a)

it suggests to use a tuple in both cases:

$ pylint /tmp/t.py -v
************* Module t
/tmp/t.py:2:12: R6102: Consider using an in-place tuple instead of set (consider-using-tuple)
/tmp/t.py:6:13: R6102: Consider using an in-place tuple instead of set (consider-using-tuple)

While the second suggestion is right (in-place sets should not be used for iterating over their content), the first one (added in #4768) is a very bad advice. In the case of a membership lookup a set is faster, compiled as a frozenset by Python, and the right tool for the job.

I would suggest that this commit be reverted before the release as it encourages bad practices. A new style check could rather suggest to replace in-place lists and tuples in membership tests by sets when the data type is known to be hashable (as strings or numbers are for example).

Configuration

No response

Command used

pylint /tmp/t.py

Pylint output

************* Module t
/tmp/t.py:2:12: R6102: Consider using an in-place tuple instead of set (consider-using-tuple)
/tmp/t.py:6:13: R6102: Consider using an in-place tuple instead of set (consider-using-tuple)

Expected behavior

I would expect to see the second suggestion (which is sound), and not the first one (which is a bad advice).

Pylint version

pylint 2.10.0-dev0 (5e5f48d05fb6f1322a7420d6ea7586966b08f80a)

OS / Environment

ArchLinux

Additional dependencies

No response

@samueltardieu samueltardieu added Bug 🪲 Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Jul 30, 2021
@Pierre-Sassoulas Pierre-Sassoulas added Discussion 🤔 performance False Positive 🦟 A message is emitted but nothing is wrong with the code and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling Bug 🪲 labels Jul 30, 2021
@Pierre-Sassoulas
Copy link
Member

Hey thank you for opening the issue. Do you have a source for the performance comparison between the two ?

@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.10.0 milestone Jul 30, 2021
@samueltardieu
Copy link
Author

samueltardieu commented Jul 30, 2021

>>> def in_set(a): return a in {"abc", "def", "ghi", "klm", "nop"}
...
>>> def in_tuple(a): return a in ("abc", "def", "ghi", "klm", "nop")
...
>>> from timeit import timeit
>>> timeit(lambda: in_set("foobar"))
0.08363517999532633
>>> timeit(lambda: in_tuple("foobar"))
0.1204142420028802

Using tuples (or lists for that matter) instead of sets require that comparisons are done sequentially until the member is found or the tuple. Also, using tuples or lists requires placing the expected-most-often-found elements first as they will be compared first, while in a set a hash comparison will find the matching element (or the absence of the key) in few operations.

As is documented in Python source Objects/setobject.c:

   Use cases for sets differ considerably from dictionaries where looked-up
   keys are more likely to be present.  In contrast, sets are primarily
   about membership testing where the presence of an element is not known in
   advance.  Accordingly, the set implementation needs to optimize for both
   the found and not-found case.

Looking up a non-existing member is the worst case when using tuples or lists because all elements must be compared to the looked-up key before declaring it a miss, while sets are also optimized for this use-case.

@samueltardieu
Copy link
Author

samueltardieu commented Jul 30, 2021

Note that a set lookup is amortized O(1) while tuple and list lookup is O(n). However, when n is small of course it makes no sense to compare costs this way, and this is not what I'm doing in the comment above. Python set objects (and particularly frozenset objects) are optimized for this particular use case including for small collections with very few objects.

Also, I've seen many code fragments looking like:

if s in {"abc", "def", "ghi"}:
  do_something(s)
elif s in {"jkl", "mno", "por"}:
  do_something_else(s)

One must realize that s hash is computed only once and cached into the str structure. Look for "por" or for an non-existing member takes 6 comparisons instead of two optimized set lookups.

@Pierre-Sassoulas Pierre-Sassoulas changed the title Unwarranted new suggestion False positive consider-using-tuple for x in set`, because frozen sets are faster Jul 30, 2021
@Pierre-Sassoulas Pierre-Sassoulas changed the title False positive consider-using-tuple for x in set`, because frozen sets are faster False positive consider-using-tuple for x in set, because frozen sets are faster Jul 30, 2021
@cdce8p
Copy link
Member

cdce8p commented Jul 30, 2021

As I have explained here before, the main reason for this change wasn't performance, but rather code style. That's also why I added it to the optional CodeStyleExtension. As it says in the documentation

Checkers that can improve code consistency. As such they don't necessarily provide a performance benefit and are often times opinionated.

The main goal is to improve code consistency, even if not always the most absolute most performant.

--
There are a few points that you neglected to consider so far:

  • Using tuples everywhere (for loops and in comparisons) will make it easier for beginners to understand when and how to use them.
  • I haven't seen many, if any at all, tutorials using sets for this comparison. It is much more common to suggest lists which is what this change is mainly about.
  • It is not obvious that Python converts the in-place set to a frozenset from the code alone.
  • The in-place defined data structure most often only contains one to three elements. Everything more than that should gets it's one variables where frozensets are more efficient.
  • The performance difference might be 30% in the worst case, but that is a false assumption. Even / especially for small tuples / sets you are just as likely that the item you're looking for is inside the container. With that the difference might only be a few percent. I would argue you won't notice a difference.
  • Tuples are more memory efficient, by a lot! (Although again you probably won't notice a difference.)
  • Tuples are still better than lists.

--
As explained above, I don't see a reason to change this check.
The only thing might worth considering would be to add an additional check which would suggest to use a frozenset for
containers larger than x items in in comparisons.

--

from pympler import asizeof
t = ("aa", "bb", "cc", "dd")
s = {"aa", "bb", "cc", "dd"}
s2 = frozenset(t)
print(f"Tuple: {asizeof.asizeof(t)}")
print(f"Set: {asizeof.asizeof(s)}")
print(f"Frozenset: {asizeof.asizeof(s2)}")
--
Tuple: 296
Set: 440
Frozenset: 440

@samueltardieu
Copy link
Author

The main goal is to improve code consistency, even if not always the most absolute most performant.

That's where our opinions differ I guess. I do not think it is ok to suggest bad practices to a user, especially a beginner, just for the sake of "code consistency". Someone using a set for a membership test is doing nothing wrong and is using the right tool, I would not understand why we would actively suggest using something inferior.

@cdce8p
Copy link
Member

cdce8p commented Jul 30, 2021

As I explained, it isn't inferior. You only considered the performance aspect, which isn't the whole picture in this case.

@samueltardieu
Copy link
Author

Also you write

It is not obvious that Python converts the in-place set to a frozenset from the code alone.

This is easy to check:

>>> import dis
>>> def f():
...     if s in {"abc", "def", "ghi"}:
...             do_something(s)
...     elif s in {"jkl", "mno", "por"}:
...             do_something_else(s)
...
>>> dis.dis(f)
  2           0 LOAD_GLOBAL              0 (s)
              2 LOAD_CONST               1 (frozenset({'def', 'ghi', 'abc'}))
              4 CONTAINS_OP              0
              6 POP_JUMP_IF_FALSE       18

  3           8 LOAD_GLOBAL              1 (do_something)
             10 LOAD_GLOBAL              0 (s)
             12 CALL_FUNCTION            1
             14 POP_TOP
             16 JUMP_FORWARD            16 (to 34)

  4     >>   18 LOAD_GLOBAL              0 (s)
             20 LOAD_CONST               2 (frozenset({'mno', 'jkl', 'por'}))
             22 CONTAINS_OP              0
             24 POP_JUMP_IF_FALSE       34

  5          26 LOAD_GLOBAL              2 (do_something_else)
             28 LOAD_GLOBAL              0 (s)
             30 CALL_FUNCTION            1
             32 POP_TOP
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE

@cdce8p
Copy link
Member

cdce8p commented Jul 30, 2021

I didn't question that behavior. But consider the following, not everyone knows about it.
Just from looking at s in {"a", "b", "c"} someone might not realize that it's converted to a frozenset.

@samueltardieu
Copy link
Author

samueltardieu commented Jul 30, 2021

As I explained, it isn't inferior. You only considered the performance aspect, which isn't the whole picture in this case.

This is wrong, the performance aspects bother me less than the bad practices ones: one should not actively suggest using sequential scanning for membership tests. This is CS 101. Even the official Python tutorial for set (https://docs.python.org/3/tutorial/datastructures.html) presents the membership test as "fast membership testing", i.e. the right tool for this.

I'm not asking for the reverse suggestion to be added although I think this would be a good idea, I'm just asking not to suggest the user to change their code from set to tuple in the case of membership test.

@samueltardieu
Copy link
Author

Just from looking at s in {"a", "b", "c"} someone might not realize that it's converted to a frozenset.

The fact that it is a frozenset rather than a set is an implementation detail, as lookups in a set would be as efficient. The important thing is that it is compiled as a set at compilation time, this is what makes the performances good.

If small sets were built at execution time each time a membership tests was done, I would totally agree with you that tuples would be better, but this is not the case.

@samueltardieu
Copy link
Author

By the way, sorry if I sound passionate about the issue, but with my CS teacher hat on I would hate to see my students change their code from a static set membership test to a tuple one, and it would affect their grade negatively if they had done so without a good justification for using a sequentially scanning tool instead of a hashing one.

@cdce8p
Copy link
Member

cdce8p commented Jul 30, 2021

Thank you @samueltardieu for the discussion about this topic.

I took some time to think about it more and looked at some more code. As it turns out Python is better optimized than I was aware of (although I should probably have expected as much). That made me question if the whole consider-using-tuple check is at all worth it.

import dis
def f(s):
    if s in ["a", "b", "c"]:
        return True
# ---
>>> dis.dis(f)
 50           0 LOAD_FAST                0 (s)
              2 LOAD_CONST               1 (('a', 'b', 'c'))
              4 CONTAINS_OP              0
              6 POP_JUMP_IF_FALSE        6 (to 12)

 51           8 LOAD_CONST               2 (True)
             10 RETURN_VALUE

 50     >>   12 LOAD_CONST               0 (None)
             14 RETURN_VALUE

As you can see from the example above, even lists are internally converted to tuples 🤦🏻 Same is true for for loops, btw. As such there is no real value in suggesting either as better tuple or list.

--
Regarding the performance differences, if we just look at tuple / list vs. set / frozenset, the following is true:

  • As you mentioned (multiple times), sets are almost always faster then tuples for membership tests. (Except for the first element, but that should be replaced by an equality check.)
  • tuples are faster when iterating over them.

--
What do we make with that?
I'm not sure there is any value at all to keeping the consider-using-tuple check around. The only place for it would be for in-place sets in iterations, but I haven't seen any of them in practice.
With regards to sets in comparisons, they are slightly faster but not (yet?) common. Many will still prefer lists / tuples, just because that's what they are used to (I am, although because of a previous misconception). If it get's the job done, it's often times good enough in Python.

--
I'm considering removing the check completely. What do you two think we should do moving forward @samueltardieu @Pierre-Sassoulas ?

--
Some alternatives to consider (mix and match)

  1. Leave the check for sets in iterations. (not really used)
  2. Add new check for tuple / list in comparisons.
  3. Add new check to suggest the use of list because they look better often times?
  4. Have checks (disabled by default) to suggest tuple or list respectively in iterations. Project could choose which style they like. (caveat starred expressions aren't allowed in tuples: [*args])

--
Update
starred expressions are possible in tuples

@samueltardieu
Copy link
Author

samueltardieu commented Jul 30, 2021

Indeed, Python prefers to build read-only versions (frozenset for a set and tuple for a list) when it can determine at compilation time that the object will not be modified. This gives a slight performance advantage in cpython in the case list vs tuple as the tuple is read-only and the interpreter does not need to temporarily increment the reference count of the element it is considering in the tuple case while it does in the list case (as the element could be dropped by another thread during the comparison, or even by the comparison function itself).

Concerning lists vs tuples, as a fan of immutable data structures, I would say that using tuples instead of lists where possible is best because it encourages people at thinking about immutable structures; however, as a long-time Pythonista myself, I tend to use lists because they look nicer than tuples, especially when they have only one element as they don't require the extra comma, and as you mention [*args] is useful as well.

I would keep the check warning about the use of in-place sets in iterations and suggest using tuples or lists, as I cannot think of any good reason for using a set there. Moreover, switching to lists or tuples will not only be more efficient but also give a consistent traversal order across Python interpreters and versions, while the set guarantees nothing in this domain.

@scop
Copy link
Contributor

scop commented Jul 30, 2021

Just a quick note, if data structures get compiled to other, optimized ones (such as lists to tuples and sets to frozensets like they are shown to be here), but nothing documents such behavior and that it's going to stay, then relying on it would be relying on internal implementation/optimization detail which might or might not hold between Python versions and Python implementations (cpython, pypy, ...). Which isn't good.

There's nothing wrong with recommending a semantically correct way to do things if there is one, and in that sense I think there's still value in this check, if the "correct" way can be agreed on, even if it would not bring any performance benefits.

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Jul 30, 2021

The way I understand, it sets are better performance wise for membership and worst for iterating, while list/tuple are equivalent. Let's not bother pylint's user for something that is equivalent and focus on the set vs list/tuple.

How about we remove the current consider-using-tuple as @cdce8p suggested because it seem really opinionated considering what python does under the hood, and replace it by use-sets-for-membership when a list or tuple is used in a membership and use-list-or-tuple-for-iterating when a set is used for iterating (admittedly I've never seen any code doing that) ?

Multi choice vote here:

  • Remove consider-using-tuple entirely 👀
  • Add use-sets-for-membership later 👍
  • Add use-list-or-tuple-for-iterating or do-not-use-sets-for-iterating later 🎉

Planning wise we could remove consider-using-tuple from 2.10, and add the two new checks later ?

@scop
Copy link
Contributor

scop commented Jul 31, 2021

I've seen literal sets used for iterating a bunch of times. IIRC all of them have been inadvertent, and some of them sources of actual bugs due to the unstable order, such as pre-commit/pre-commit#1117. So in my experience something flagging those is most welcome. Not sure which of the choices represent this wish of having some way to check it best, but I'm placing it on 🎉 above.

Re use-list-or-tuple-for-iterating, I'd personally try to nudge towards using tuples there too, as to me it's the most idiomatic way and does not rely on undocumented (I think?) optimization/implementation details. Perhaps by making the identifier use-tuple-or-list-for-iterating, or something like do-not-use-sets-for-iterating and indicate preference towards tuples over lists also in the help message in both cases.

Is the plan BTW to limit the check against sets for iteration for literal sets only, or all of them? There's a difference to me there; non-literal is more likely to be intentional/harmless, maybe that shouldn't cause a warning at all, dunno.

@cdce8p
Copy link
Member

cdce8p commented Aug 1, 2021

I've seen literal sets used for iterating a bunch of times.

Certainly an argument in favor of adding / keeping the check for iterations.

use-list-or-tuple-for-iterating

Maybe we should split this into two separate checks. One use-sequence-for-iterating, and another consider-using-tuple if a list is used for in-place iterations. That way projects could choose if they want to enforce tuples but independently still keep a check to make sure no in-place sets are used there.

starred expressions in tuples

I mentioned earlier that starred expressions [*args] aren't possible in tuples. That statement was incorrect. After looking at it again, I found that it does work! You just need to remember the trailing comma (which I forget the first time). That means that in-place lists can be fully replaced by in-place tuples.

--
@scop The current consider-using-tuple check is limited to in-place / literal uses of list and set. I agree with your assumption that non-literal uses are more likely intentional.

@Pierre-Sassoulas
Copy link
Member

Maybe we should split this into two separate checks. One use-sequence-for-iterating, and another consider-using-tuple if a list is used for in-place iterations.

That make sense, I like that.

@Pierre-Sassoulas
Copy link
Member

Moving that issue to 2.11 as use-set-for-membership raise problem with unashable element see #4841 (comment) .

@Pierre-Sassoulas
Copy link
Member

We realized in #4841 that some object not being hashable means that suggesting to use set is impossible for something like this:

     letters= ["a", "b"] 
     numbers = [1, 42] 
 if "c" in (letters, numbers,):
     return 

What would be the preferred way for you going forward ?

  • Do not add use-set-for-membership in pylint at all 👀
  • Add use-set-for-membership and keep it like it is, the user will choose between making the object hashable or silencing pylint 👍
  • Add use-set-for-membership but check that all objects in the original container implements __hash__ and __cmp__ or __eq__ (ie they're hashable) 🚀

@cdce8p
Copy link
Member

cdce8p commented Aug 17, 2021

I'd prefer not to add use-set-for-membership. As this is a refactoring suggestion, I believe it would only be worth it if it has a clear real world performance benefit and / or no negative side effects. Neither is the case.

We have to consider it might not be obvious for some that sets require hashable items. Someone might start with a list, get the suggestion to use a set and later adds an item that isn't hashable without realizing it. An error that wasn't necessary to begin with.

Furthermore, there is more maintenance required if we add some kind of heuristic. It won't be perfect so we would need to fix corner cases constantly. In the end we might even need a new check to recognize the error I described earlier.

IMO it's just not worth it.

--
Instead of a direct replacement for all, we could consider something like consider-set-for-membership for lists / tuples with a lot of items (e.g., > 5). There the performance difference might be more noticeable.

@Pierre-Sassoulas
Copy link
Member

This will be fixed by the extension SetMembershipChecker in 2.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
False Positive 🦟 A message is emitted but nothing is wrong with the code performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants