Excessive memory usage to scan big projects #1495

tardyp · 2017-05-23T14:21:34Z

Steps to reproduce

git clone github.com/buildbot/buildbot
cd buildbot
make virtualenv
. .venv/bin/activate
make pylint
top

Current behavior

pylint consumes up to 1GB of memory

Expected behavior

pylint should use less memory

pylint --version output

pylint 1.7.1,
astroid 1.5.2
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]

We are looking for tips to improve the situation as our CI docker machines starts to be randomly killed by OOM.

rogalski · 2017-05-31T19:06:02Z

Thanks for submitting an issue!

I want to give it a shot, although I never really memory profiled Python app. Any tips will be appreciated.

tardyp · 2017-05-31T19:47:48Z

@rogalski you may want to try https://pypi.python.org/pypi/memory_profiler I have never used it myself though.

beledouxdenis · 2017-06-01T12:14:37Z

I just had a look to this memory issue as well. The memory is quite a critical point to our project as well, and all we needed was a simple syntax check on each file, there was no need for a global check. Each file could be checked "stand-alone".

I could have ran the executable on each file, but then it's the process time which was increased quite a bit.

Here is the solution I finally found, which reduced the memory used from 1.4GO to less than 400MO. We called pylint from a python script directly, not with the executable, and cleared the astroid_cache of the MANAGER after each check. I extracted the meaningful part of the code:

from os import walk
from os.path impot join
from pylint import lint
from pylint.lint import MANAGER
reporter = Reporter()
linter = lint.PyLinter(reporter=reporter)
linter.load_default_plugins()
linter.load_plugin_modules(PUGIN_MODULES)
linter.load_command_line_configuration(options)
for path in paths:
    for root, dirs, files in walk(path):
        for file in filter(lambda x: x.endswith('.py'), files):
            linter.check(join(root, file))
            MANAGER.clear_cache()

if reporter.messages:
    out = '\n'.join('%s:%s (%s)%s' % (message['path'], message['line'], message['id'], message['message']) for message in reporter.messages)

If you would like the options variable or the Reporter() class I can share them.

beledouxdenis · 2017-06-01T12:34:09Z

Well, I was a bit too enthusiast: It does reduce the memory used, indeed, but it multiplies the processing time quite a bit as well.

I guess the cache is quite useful :D. But it caches too much in our case. I will check what exactly take the tremendous spaces in this cache, and see if it can be disabled somehow.

beledouxdenis · 2017-06-01T16:40:23Z

New POC:

import pylru

from pylint import lint
from pylint.lint import MANAGER


class lrucache(pylru.lrucache):
    def setdefault(self, key, value):
        if key in self:
            return self[key]
        self[key] = value
        return value


MANAGER.astroid_cache = lrucache(200)
reporter = Reporter()
linter = lint.PyLinter(reporter=reporter)
linter.load_default_plugins()
linter.load_plugin_modules(self.PUGIN_MODULES)
linter.load_command_line_configuration(options)

for path in paths:
    for root, dirs, files in walk(path):
        for file in filter(lambda x: x.endswith('.py'), files):
            linter.check(join(root, file))
            del linter._checkers['imports'][0]._imports_stack[:]

if reporter.messages:
    out = '\n'.join('%s:%s (%s)%s' % (message['path'], message['line'], message['id'], message['message']) for message in reporter.messages)
    self.fail('\n' + out)

There were two things that consumed a lot of memory for our project:

This linter._checkers['imports'][0]._imports_stack[:], which seems to be a cache for the imports order check,
The MANAGER.astroid_cache, which caches the imported modules as well, which is important for the process time of the overall project. By replacing it by a LRU cache, I loose about 30 seconds for the process time, but It saves 1GB of memory. Basically, this LRU cache just avoids to cache modules which will no longer be accessed, or less often, further in the process, saving memory.

I guess you have to choose between time and memory, by setting a greater LRU cache or not. In our case it's the memory we have less and 200 items in the cache seems to be a good setup for us.

The fact I overrode the pylru.cache class is because setdefault was not implemented, while it is used by astroid to cache things.

Of course, for the del _imports_stack[:], just deleting it is not a proper solution, as it probably messes up this imports order check, and maybe others. A best solution would be either:

To use a LRU as well, but this is a list, and LRUs are dicts. This could be replaced by a dict nevertheless, I guess,
To remove the cached imports which are no longer needed as time goes by,
To store less things, and not the entire import/node information. I guess storing all the information is actually not important for the imports order check, or the other potential check.

I am no expert of this library, what I say could be wrong :D. I am open to your remarks :)

PCManticore · 2017-06-01T21:40:00Z

Without looking much in the code, I think we can safely not store the entire nodes in the _import_stack, just the attributes we need. For the astroid cache, I'm not sure what a good solution will be, maybe using a LRU could work, but someone has to test what this would entail.

rogalski · 2017-06-02T08:22:10Z

Btw. is it a regression from 1.6.x? We started to infer results of safe_infer in 1.7.x, maybe that accounts for extensive memory usage (although cache of modules AST trees seems like a much more likely cause).

beledouxdenis · 2017-06-02T08:32:15Z

We just find out why we had to empty manually _imports_stack, and this might be a bug. I let you be the judge:

Modules are appended to this _imports_stack list in the method _record_import:
https://github.com/PyCQA/pylint/blob/master/pylint/checkers/imports.py#L548

Which is called in the two below methods:

visit_import: https://github.com/PyCQA/pylint/blob/master/pylint/checkers/imports.py#L377
visit_importfrom: https://github.com/PyCQA/pylint/blob/master/pylint/checkers/imports.py#L402

Which are called when the below messages are activated:

import-error
relative-beyond-top-level
cyclic-import
wildcard-import
deprecated-module
relative-import
reimported
import-self
misplaced-future
multiple-imports
wrong-import-order
ungrouped-imports
wrong-import-position

There is well a method which empty this list regularly:

leave_module: https://github.com/PyCQA/pylint/blob/master/pylint/checkers/imports.py#L444

However, this one seems to be called only for the three below messages:

wrong-import-order
ungrouped-imports
wrong-import-position

Which makes us conclude that it can happen that this list is being filled but never emptied according to the messages you activate in the linter.

This is our case: we activated for instance the message relative-import, but we did not activate one of the three messages above which empty this _imports_stack. Once I add one of the three above messages, let say wrong-import-order, then, without emptying this list manually (with del linter._checkers['imports'][0]._imports_stack[:] in my above code), the memory used is the same than when I empty this list manually but without activating wrong-import-order.

Maybe

@check_messages(*(MSGS.keys()))

should be put on leave_module instead of

@check_messages('wrong-import-order', 'ungrouped-imports', 'wrong-import-position')

Well, at least not the whole method, as it is checking things for the above three messages/checks, but at least the part

        self._imports_stack = []
        self._first_non_import_node = None

That said, as I am no expert of this library, the fact the list is emptied only for these 3 messages is done like this by design. It would be odd, though, as by enabling a new check in the linter, I end up using less memory.

xmo-odoo · 2017-06-02T09:40:09Z

Btw. is it a regression from 1.6.x?

Our issues (cf @beledouxdenis comments) happen on 1.6.4.

See pylint-dev/pylint#1495 The pylint test now uses less memory: - Thanks to a LRU cache for the astroid cache (caching the modules body) - Activating the test `ungrouped-imports` then ignoring its messages due to a memory leak if not activated See pylint-dev/pylint#1495 (comment)

Attempts to help with pylint-dev#1495

Attempts to help with #1495

tardyp · 2018-10-28T10:49:32Z

Coming back on this issue.. We have been gradually increasing our VM requirement for pylint checking from 1GB to 2GB to 4GB during this year. Now we are start having instabilities. again. not sure about OOM or just 20min without output timeout. would be nice to have a progress output in order to avoid no-output timeouts. looks like there is no such option, is it?

PCManticore · 2018-11-23T12:52:47Z

@tardyp right now there is no option for progress output, but could definitely be useful.

The last time we tried this upgrade we encountered timeouts on the quality job, which it now appears were due to the worker running pylint common running out of memory and killing the Jenkins process. Switching to a different worker type with double the RAM (8 GB vs. 4 GB) seems to have fixed this; about 5.5 GB was used. Upstream is aware of the high memory usage on large projects, it's apparently due primarily to a cache of parsed modules: pylint-dev/pylint#1495 . Even after disabling some of the new checks that have been added, the new version of pylint found about twice as much to complain about. Just bumping the threshold for now to unblock the Django upgrade, we can try automated utilities like pyupgrade to fix some of these later.

martinpitt · 2020-10-13T15:31:04Z

I ran into this again, with the current version 2.6.0. The pylint processes pile up more and more memory over time, after a few minutes they were taking over 1.5 GiB of residential RAM. Given that it analyzes hundreds of little files, this sure smells like a memory leak:

$ podman run -it --rm --memory=2G fedora:33

(Not specific to podman -- you can use docker, or a VM, or anything where you can control the amount of RAM)

and inside:

dnf install -y python3-pip git 
pip install pylint
git clone --depth=1 https://github.com/rhinstaller/anaconda
cd anaconda
pylint -j0 pyanaconda/

In top you can see VIRT and RES memory constantly increasing. This happens regardless of -j1 or -j0 (which selects 4 parallel processes for me). Near the end it looks like

#     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
#   88765 martin    20   0 1597416   1,5g   6580 R  96,3   9,8   5:19.73 pylint

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

max() was bogus -- we want *both* RAM and number of CPUs to limit the number of jobs, not take whichever is highest. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899 Cherry-picked from master PR rhinstaller#2923 Related: rhbz#1885635

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

hippo91 · 2020-10-17T10:24:58Z

@martinpitt thanks for your message. Note that there is currently a PR pylint-dev/astroid#837 that addresses this pb.

hippo91 · 2020-10-31T15:07:13Z

@martinpitt PR pylint-dev/astroid#847 has been merged into master. Please can you test if it improves your situation?

martinpitt · 2020-11-24T07:06:13Z

Sorry for taking so long to get back to this! With parallel processes (-j0) I get weird effects (hangs, etc.), but I'll investigate these as part of #3899. I re-ran my reproducer from above with -j1, and memory usage with current pylint 2.6.0 and astroid 2.4.2 is still high:

 18588 martin    20   0 1544748   1,5g   6788 R  99,7   9,5   3:30.76 pylint

I applied the patch from pylint-dev/astroid#847 :

# curl https://github.com/PyCQA/astroid/commit/d62349a424c549b4634c90e471c9f876b99edfeb.patch | patch  /usr/local/lib/python3.9/site-packages/astroid/manager.py

and that makes the memory usage much more reasonable indeed, about 400 MB RSS. Many thanks!

hippo91 · 2020-11-24T19:33:31Z

@martinpitt thanks for your feedback. I close this issue then.

PCManticore added the Enhancement ✨ Improvement to a component label Jun 1, 2017

rogalski added a commit to rogalski/pylint that referenced this issue Aug 6, 2017

ImportsChecker: check all messages in leave_module

d3c8a26

Attempts to help with pylint-dev#1495

rogalski mentioned this issue Aug 6, 2017

ImportsChecker: check all messages in leave_module #1611

Merged

rogalski added a commit that referenced this issue Aug 7, 2017

ImportsChecker: check all messages in leave_module (#1611)

8d0eeb2

Attempts to help with #1495

PCManticore added the performance label Aug 1, 2018

PCManticore added this to TODO in Faster pylint Aug 31, 2018

jmbowman mentioned this issue Feb 28, 2020

Upgrade edx-lint BOM-1298 openedx/edx-platform#23227

Merged

martinpitt mentioned this issue Oct 13, 2020

Restrict pylint parallelism to available RAM rhinstaller/anaconda#2918

Merged

martinpitt mentioned this issue Oct 13, 2020

Hangs forever after a parallel job gets killed #3899

Closed

martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020

More aggressive pylint downscaling

b71481d

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

martinpitt added a commit to martinpitt/anaconda that referenced this issue Oct 14, 2020

More aggressive pylint downscaling

89af64d

pylint hits 2 GiB per process pretty often [1], thus still causing OOM failures [2]. Bump the divisor. [1] pylint-dev/pylint#1495 [2] pylint-dev/pylint#3899

martinpitt mentioned this issue Nov 24, 2020

[r8] Apply pylint excessive memory usage fix rhinstaller/anaconda#3017

Merged

hippo91 closed this as completed Nov 24, 2020

DanielNoord moved this from TODO to Done in Faster pylint Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage to scan big projects #1495

Excessive memory usage to scan big projects #1495

tardyp commented May 23, 2017

rogalski commented May 31, 2017 •

edited

tardyp commented May 31, 2017

beledouxdenis commented Jun 1, 2017 •

edited

beledouxdenis commented Jun 1, 2017 •

edited

beledouxdenis commented Jun 1, 2017 •

edited

PCManticore commented Jun 1, 2017

rogalski commented Jun 2, 2017

beledouxdenis commented Jun 2, 2017 •

edited

xmo-odoo commented Jun 2, 2017

tardyp commented Oct 28, 2018

PCManticore commented Nov 23, 2018

martinpitt commented Oct 13, 2020

hippo91 commented Oct 17, 2020

hippo91 commented Oct 31, 2020

martinpitt commented Nov 24, 2020

hippo91 commented Nov 24, 2020

Excessive memory usage to scan big projects #1495

Excessive memory usage to scan big projects #1495

Comments

tardyp commented May 23, 2017

Steps to reproduce

Current behavior

Expected behavior

pylint --version output

rogalski commented May 31, 2017 • edited

tardyp commented May 31, 2017

beledouxdenis commented Jun 1, 2017 • edited

beledouxdenis commented Jun 1, 2017 • edited

beledouxdenis commented Jun 1, 2017 • edited

PCManticore commented Jun 1, 2017

rogalski commented Jun 2, 2017

beledouxdenis commented Jun 2, 2017 • edited

xmo-odoo commented Jun 2, 2017

tardyp commented Oct 28, 2018

PCManticore commented Nov 23, 2018

martinpitt commented Oct 13, 2020

hippo91 commented Oct 17, 2020

hippo91 commented Oct 31, 2020

martinpitt commented Nov 24, 2020

hippo91 commented Nov 24, 2020

rogalski commented May 31, 2017 •

edited

beledouxdenis commented Jun 1, 2017 •

edited

beledouxdenis commented Jun 1, 2017 •

edited

beledouxdenis commented Jun 1, 2017 •

edited

beledouxdenis commented Jun 2, 2017 •

edited