Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support detection of shadowed built-in python modules #120

Closed
asfaltboy opened this issue Jan 9, 2024 · 13 comments
Closed

Support detection of shadowed built-in python modules #120

asfaltboy opened this issue Jan 9, 2024 · 13 comments

Comments

@asfaltboy
Copy link
Contributor

Similar to how check_import() compares to names of builtins but this time covering packages like logging or secrets which is a common python gotcha.

The list of these packages exists as the sys.stdlib_module_names tuple from 3.10, so I hope 🙏 we could trivially support this rule in python 3.10+ to begin with.

Some example references:

@gforcada
Copy link
Owner

Hi @asfaltboy thanks for taking the time to provide feedback! 🤗 🎉

So, the idea would be that rather than checking the content of the files, we would check the name of the files themselves and compare them to the list of builtin modules, right? 🤔

That's a very good idea IMHO! ✨

A quick look at the code though shows that usually flake8 is giving us either an ast tree or the raw input, if the user pipes their python code to flake8. Maybe there is an extra option to get the filenames, I would have to investigate.

I'm happy to review PRs on that direction though, if you feel like looking into it! 🌟

@asfaltboy
Copy link
Contributor Author

Hi @gforcada nice to meet you 👋 my apologies for not getting back to you sooner.

I haven't actually thought about how we'd solve it, but what you say does make sense. Thank you for setting me on the right course.

I have checked out the and wrote a test, and now that I'm thinking about the implementation I have a couple of questions:

  1. Do we want to error without pointing to a statement locality? Is a script shadowing a built-in module name being in the project, on it's own, harmful enough? Even if we don't import it?
  2. Or, do we want to reference the import statement that imports the script? And if this is what we want, how do we know when to add it... it seems to me that to support this case, we'd need to change our run into a 2-pass approach:
    a. first run the check for all the files and keep track of all shadowing module names
    b. next, rerun the original checks, as before, adding a new check alongside the import validation that exist today. The new check can simply confirm that the imported name exists in the previously found shadowing module names.

I'm leaning a bit towards (1) as this approach is simpler to start with, and would protect against future import or improper usage downstream (see here for an example).

@gforcada
Copy link
Owner

gforcada commented Feb 5, 2024

@asfaltboy hi! 👋🏾 no need to apologize, I'm not specially known for replying fast in GitHub 😅

your 1) idea sounds good, can we achieve that within flake8 though? 🤔 that would be interesting to know.

Otherwise my idea would be to check the imports, and if the import is not a stdlib import, i.e. from my.package import logging then we can error on that line.

It might be tricky though, as some distributions might want to do that on purpose, but as you were saying, those distributions surely know how to turn off a flake8 warning, while the ones that do not do it on purpose probably would be happy to know about this 😄

Let me know if you need help, otherwise, unfortunately, I would not push it myself, at least not right now

@asfaltboy
Copy link
Contributor Author

can we achieve that within flake8 though? 🤔 that would be interesting to know.

This is the first thing I'm going to check. I'm hoping that filename is already being passed from flake8 according to the property in BuiltinsChecker. But I will check end to end locally and report back in #121

Otherwise my idea would be to check the imports, and if the import is not a stdlib import, i.e. from my.package import logging then we can error on that line.

My fear is that even if we don't import the package it will break when using a package that wants to import the builtin. I certainly experienced this with at least a couple of libraries. I think it was with transformers and scipy, but will try to provide a 'Minimal reproducible example' so we can reason about it easier.

Let me know if you need help, otherwise, unfortunately, I would not push it myself, at least not right now

No worries dude. I have suddenly come across a bit of extra time, but also this feature is not very urgent either. I will do my best to do small bits of incremental work on this.

@gforcada
Copy link
Owner

And flake8-builtins==2.3.0 is released with it! 🎉

Thanks for contributing!! ✨

@reata
Copy link

reata commented Apr 7, 2024

Hi all, thanks for this wonderful library and feature. Much appreciated.

I like this idea to check module name conflict with python built-in. However, A005 sort of works unintended for me. I'm developing this sqllineage package, where I provide the following modules:

from sqllineage.core.parser.sqlfluff.extractors import copy
from sqllineage.core.parser.sqlfluff.extractors import select
from sqllineage import io

They're conflicting with builtin modules, true. But I would argue that these modules, in particular io, are common enough for package owners to name things after. As long as it's not used as top-level module, it's no harm at all.

I'm turning off A005 for now and would love to know your thoughts. Thanks.

@gforcada
Copy link
Owner

gforcada commented Apr 8, 2024

Hi @reata, thanks for using and providing feedback about flake8-builtins 😊

@asfaltboy provided a --builtins-allowed-modules option that one can use to ignore certain builtin modules (in your case io), would that work for you?

The way A005 operates is by looking at the filenames of the modules, rather than at the imports, we have some other codes for that (I think).

@asfaltboy
Copy link
Contributor Author

asfaltboy commented Apr 8, 2024

Thanks both, I'll try to add more details.

This new A005 rule is intended more for application devs than library devs, as it occurs when modules in the python path shadow the built-in modules.

For instance, a user of sqllineage library could create two modules in their current dir:

  1. an empty typing.py file
  2. a run.py file that contains import sqllineage

When a user runs run.py they'll get this error:

python run.py 
Traceback (most recent call last):
  File "/home/asfaltboy/Code/oss/temp/run.py", line 1, in <module>
    from sqllineage import runner
  File "/home/asfaltboy/Code/oss/sqllineage/sqllineage/runner.py", line 4, in <module>
    from typing import Dict, List, Optional, Tuple
ImportError: cannot import name 'Dict' from 'typing' (/home/asfaltboy/Code/oss/temp/typing.py)

With the A005 rule, the user can check their project for this issue, and identify the root cause before the code even runs.

Note that an automated testing with a tool like pytest won't necessarily catch this, as they change the way modules are imported (e.g run_test.py importing run.py does not crash):

❯ flake8 ./
./typing.py:0:1: A005 the module is shadowing a Python builtin module "typing"

@asfaltboy
Copy link
Contributor Author

Note, while testing this out just now, I noticed that python also comes with a bunch of frozen modules, that are necessary for the import system to function in the first place:

>>> import sys
>>> from importlib import FrozenImporter
>>> [k for k, m in sys.modules.items() if m.__loader__ is FrozenImporter]
['_frozen_importlib', '_frozen_importlib_external', 'zipimport', 'codecs', 'abc', 'io', 'stat', '_collections_abc', 'genericpath', 'posixpath', 'os.path', 'os', '_sitebuiltins', 'site', 'importlib._bootstrap', 'importlib._bootstrap_external', 'importlib.machinery']

As you can see, io is one of these, which means it's "pre-cached" when we start the interpreter, and having an io.py module in the current dir does not break things...

However, I'm not sure how stable this list of frozen packages may change over time, and I'm not sure if it's a good idea to complicate the rule's behaviour and exclude packages in this lib, as it might be confusing to end users

@reata
Copy link

reata commented Apr 9, 2024

Thanks both for the detailed information.

I second to @asfaltboy that the A005 is intended more for application devs than library devs. So I would prefer turning off A005 to --builtins-allowed-modules for my use case.

I'd be very interested to know if/when you decide to adapt A005 for libraries so that sqllineage.io module won't be seen as shadowing built-in modules.

BTW: nice to know about this 'frozen modules' thing that there's safeguard. I always thought we're empowered to shadow every built-in module and should tread carefully.

@matrss
Copy link

matrss commented Apr 11, 2024

Based on how I understand what this lint rule is supposed to check for, we get a few false-positives in Open-MSS/MSS#2312.

E.g. we get

mslib/utils/time.py:0:1: A005 the module is shadowing a Python builtin module "time"

because the file is named time.py, but an import of that module would be import mslib.utils.time so there is no name collision (unless, of course, it is imported in a way to get a collision: from mslib.utils import time, but that would be an issue in the importing file, not in the imported module). AFAIU there is no way that this file could ever shadow the builtin time module, unless <...>/mslib/utils (in this case) was added to PYTHONPATH (and that should never happen).

I think this lint rule should consider the full names by which a module is actually importable, and not just check the file name.

@asfaltboy
Copy link
Contributor Author

asfaltboy commented Apr 11, 2024

I think excluding nested modules, assuming flake8 runs from the root of the project, would prevent most of these false positives in library packages (like with @reata 's report above).

Though, we might introduce false negatives, for application projects that choose to nest their scripts out of convenience. E.g: scripts/logging.py, with this directory later added to PYTHONPATH by cd-ing into it for instance.

Is the false positive issue significantly more frequent than the above pattern 🤔? If it is, then I think we should go ahead with this change. If not, then perhaps we need another solution.

@gforcada what are your thoughts?

Side note: I think it would be hard to support all plausible project structure/ import patterns. My hope was that the new rule would discourage people from using the built-in names entirely. That said, I can see that the common pattern of namespaced/nested modules likely avoids the issue, so it's reasonable to support it

@gforcada
Copy link
Owner

Sorry to get late to the party 😓

My thoughts:

  • for junior developers OR small code bases, preventing them from naming their modules time, io, etc I think it is a good idea and we should keep this check enabled by default.

  • for senior developers OR big code bases, where it makes sense to name your modules that overlap with python builtins: you probably are configuring other flake8 plugins, so either disabling this rule completely or using per-file-ignores might not be too much hassle 🍀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants