Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously increasing memory usage when pylint is run via its API #792

Closed
char101 opened this issue May 22, 2020 · 25 comments · Fixed by #1521
Closed

Continuously increasing memory usage when pylint is run via its API #792

char101 opened this issue May 22, 2020 · 25 comments · Fixed by #1521
Labels
Bug 🪳 Help wanted 🙏 Outside help would be appreciated, good for new contributors topic-performance

Comments

@char101
Copy link

char101 commented May 22, 2020

Steps to reproduce

When using pylint in-process to lint a file (in this case in vim), the memory usage of the editor is continually increasing. The increase is even higher when the cache is cleared. I believe that this is caused because the module is still referenced by lru_cache even though it has been deleted from the manager cache and each new parsing adds a new cache entry.

import gc

import astroid
import psutil
from pylint import lint
from pylint.reporters.base_reporter import BaseReporter


class Reporter(BaseReporter):
    def _display(self, layout):
        pass


def run1():
    print('run without clearing cache')
    for _ in range(0, 5):
        lint.Run(['--reports', 'n', 'pyparsing1.py'], reporter=Reporter(), exit=False)
        print(psutil.Process().memory_info().rss)


def run2():
    print('run with cache clear')
    for _ in range(0, 5):
        lint.Run(['--reports', 'n', 'pyparsing1.py'], reporter=Reporter(), exit=False)
        astroid.builder.MANAGER.astroid_cache.clear()
        gc.collect()
        print(psutil.Process().memory_info().rss)


def run3():
    print('run with clearning only checked module cache')
    for _ in range(0, 5):
        lint.Run(['--reports', 'n', 'pyparsing1.py'], reporter=Reporter(), exit=False)
        del astroid.builder.MANAGER.astroid_cache['pyparsing1']
        gc.collect()
        print(psutil.Process().memory_info().rss)


def main():
    run1()
    run2()
    run3()


if __name__ == '__main__':
    main()

The linted file is 253KB pyparsing.py renamed to pyparsing1.py.

run without clearing cache
118296576
130756608
143630336
156053504
168882176
run with cache clear
181768192
263176192
345382912
427147264
507686912
run with clearning only checked module cache
589205504
615079936
641482752
665550848
689803264

Current behavior

Increasing memory usage

Expected behavior

Stable memory usage

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.4.1

@char101
Copy link
Author

char101 commented May 22, 2020

LookupMixIn uses lru_cache on its lookup method with unlimited size, that means any object of the classes that uses LookupMixIn will be cached indefinitely.

@wzorgdrager
Copy link

Is there any workaround for this?

@richtier
Copy link

richtier commented Jan 2, 2021

perhaps this: https://stackoverflow.com/a/50699209

import functools
import gc

gc.collect()
wrappers = [
    a for a in gc.get_objects() 
    if isinstance(a, functools._lru_cache_wrapper)]

for wrapper in wrappers:
    wrapper.cache_clear()

@Pierre-Sassoulas Pierre-Sassoulas added Bug 🪳 Help wanted 🙏 Outside help would be appreciated, good for new contributors topic-performance labels May 29, 2021
@superbobry
Copy link
Contributor

superbobry commented Jul 29, 2021

I wonder if it would make sense for astroid to implement public APIs for controlling caches? For example, astroid could have its own lru_cache wrapper which would record all created LRU caches in a weak container and allow to clear all of them. @Pierre-Sassoulas wdyt?

As an aside, pylint has similar issues and any solution we implement here could probably be applied to pylint as well.

@Pierre-Sassoulas
Copy link
Member

Right now the "cache" is simply a mutable default argument value in a function, so some change needs to be made to makes that happen. What do you have in mind for this API @superbobry ?

@superbobry
Copy link
Contributor

superbobry commented Jul 29, 2021

Sorry, if I was vague. I was thinking of something like

LRU_CACHES = weakref.WeakSet()

def lru_cache(maxsize=128, typed=False):
  def wrapper(func):
    cached_func = functools.lru_cache(maxsize=maxsize, typed=typed)(func)
    LRU_CACHES.add(cached_func)
    return cached_func
  return wrapper

# ...

def clear_caches():
  for c in LRU_CACHES:
    c.cache_clear()

so that any internal astroid/pylint API which needs to be cached is explicitly tracked and invalidated via clear_caches(). I suspect we might also consider invaliding MANAGER caches, but I'm not very familiar with what it caches&when.

@Pierre-Sassoulas
Copy link
Member

This seem better than what we have right now. But I think we need some auto-clearing mechanism because what would be the trigger to clear the cache in pylint otherwise ? I'm absolutely not a caching expert, but maybe we can count the number of time a cache is used and keep only the "useful" entry?

@superbobry
Copy link
Contributor

I was assuming clear_caches is a public endpoint which can be called by astroid/pylint users. Wdyt?

@char101
Copy link
Author

char101 commented Aug 2, 2021

Last time I check the problem is that the cache (or is it a variable) is a tree of module dependencies. That means for example the os module could be cached twice if we are linting two different files which import the os module. It would be more efficient both in memory and execution to use a key based cache.

@superbobry
Copy link
Contributor

Hey @char101, this sound interesting. Could you link me to the relevant bits of the code where this is happening?

@char101
Copy link
Author

char101 commented Aug 5, 2021

Sorry it was more than a year ago. Looking at pylint memory usage now seems to be much better.

module.py

import os

main.py

import gc

import astroid
import psutil

from pylint import lint
from pylint.reporters.base_reporter import BaseReporter


class Reporter(BaseReporter):
    def _display(self, layout):
        pass


def main():
    prev = psutil.Process().memory_info().rss
    print(prev)
    for _ in range(0, 30):
        lint.Run(['--reports', 'n', 'module.py'], reporter=Reporter(), exit=False)
        gc.collect()
        rss = psutil.Process().memory_info().rss
        print(rss, rss - prev)
        prev = rss


if __name__ == '__main__':
    main()
40730624
48734208 8003584
48771072 36864
48779264 8192
48783360 4096
48807936 24576
48820224 12288
48840704 20480
48922624 81920
48955392 32768
48959488 4096
48959488 0
48967680 8192
48988160 20480
49008640 20480
49020928 12288
49090560 69632
49090560 0
49090560 0
49090560 0
49102848 12288
49119232 16384
49143808 24576
49221632 77824
49221632 0
49221632 0
49221632 0
49221632 0
49287168 65536
49287168 0
49287168 0

Unfortunately recent pylint installed from pypi would throw an exception when linting pyparsing.py

Exception on node <ClassDef.ParseResults l.502 at 0x2052926eca0> in file 'R:\pylint\pyparsing1.py'
Traceback (most recent call last):
  File "R:\pylint\main.py", line 26, in <module>
    main()
  File "R:\pylint\main.py", line 18, in main
    lint.Run(['--reports', 'n', 'pyparsing1.py'], reporter=Reporter(), exit=False)
  File "R:\pylint\env\lib\site-packages\pylint\lint\run.py", line 384, in __init__
    linter.check(args)
  File "R:\pylint\env\lib\site-packages\pylint\lint\pylinter.py", line 975, in check
    self._check_files(
  File "R:\pylint\env\lib\site-packages\pylint\lint\pylinter.py", line 1009, in _check_files
    self._check_file(get_ast, check_astroid_module, name, filepath, modname)
  File "R:\pylint\env\lib\site-packages\pylint\lint\pylinter.py", line 1035, in _check_file
    check_astroid_module(ast_node)
  File "R:\pylint\env\lib\site-packages\pylint\lint\pylinter.py", line 1172, in check_astroid_module
    retval = self._check_astroid_module(
  File "R:\pylint\env\lib\site-packages\pylint\lint\pylinter.py", line 1217, in _check_astroid_module
    walker.walk(ast_node)
  File "R:\pylint\env\lib\site-packages\pylint\utils\ast_walker.py", line 77, in walk
    self.walk(child)
  File "R:\pylint\env\lib\site-packages\pylint\utils\ast_walker.py", line 79, in walk
    callback(astroid)
  File "R:\pylint\env\lib\site-packages\pylint\checkers\classes.py", line 904, in leave_classdef
    self._check_unused_private_attributes(node)
  File "R:\pylint\env\lib\site-packages\pylint\checkers\classes.py", line 983, in _check_unused_private_attributes
    assign_attr.expr.name == "cls"
AttributeError: 'Subscript' object has no attribute 'name'

With 245 KB test_tables.py from pytables

40759296
186531840 145772544
207368192 20836352
223911936 16543744
223764480 -147456
225419264 1654784
225538048 118784
225759232 221184
225861632 102400
226070528 208896
226131968 61440
226328576 196608
226582528 253952
226619392 36864
226947072 327680
227172352 225280
227127296 -45056
227127296 0
227164160 36864
227405824 241664
227405824 0
227418112 12288
227495936 77824
227700736 204800
227700736 0
227708928 8192
227840000 131072
227971072 131072
227971072 0
227975168 4096
228040704 65536

@Pierre-Sassoulas
Copy link
Member

@char101 are you using the latest pylint ? This is supposed to be fixed by pylint-dev/pylint#4439 (ie. astroid 2.6.5)

@char101
Copy link
Author

char101 commented Aug 5, 2021

I was using the version in pypi

> env\scripts\pip list
Package           Version
----------------- -------
astroid           2.6.6
colorama          0.4.4
isort             5.9.3
lazy-object-proxy 1.6.0
mccabe            0.6.1
pip               20.2.3
psutil            5.8.0
pylint            2.9.6
setuptools        49.2.1
toml              0.10.2
wrapt             1.12.1

superbobry added a commit to superbobry/astroid that referenced this issue Aug 19, 2021
The mutable default is problematic when astroid is used as a library,
because it effectively becomes a memory leak, see pylint-dev#792.

This commit moves the cache to the global namespace and adds a public
API entry point to clear it.
superbobry added a commit to superbobry/astroid that referenced this issue Aug 19, 2021
The mutable default is problematic when astroid is used as a library,
because it effectively becomes a memory leak, see pylint-dev#792.

This commit moves the cache to the global namespace and adds a public
API entry point to clear it.
Pierre-Sassoulas pushed a commit that referenced this issue Aug 27, 2021
* Removed mutable default value in _inference_tip_cache

The mutable default is problematic when astroid is used as a library,
because it effectively becomes a memory leak, see #792.

This commit moves the cache to the global namespace and adds a public
API entry point to clear it.

* Removed the itertools.tee call from _inference_tip_cached

This commit is not expected to affect the behavior, and if anything, should
improve memory usage, because result is only materilized once (before it was
also stored in-full inside itertools.tee).
@keichi
Copy link
Contributor

keichi commented Nov 8, 2021

I've digged into this issue as it is still present in the latest release (2.8.4). The following tweaks combined fixed the memory leak for me (RSS is <100MB after parsing ~100K files ):

  1. Disabling @lru_cache used without capacity in the following places:
    • astroid/interpreter/_import/spec.py
    • astroid/interpreter/objectmodel.py
    • astroid/nodes/node_classes.py
    • astroid/transforms.py
  2. Disabling @_cached_generator defined and used in inference.py
  3. Calling astroid.astroid_manager.MANAGER.clear_cache()
  4. Calling astroid.context.clear_inference_tip_cache()

In summary, I think we need to implement a common custom caching mechanism as suggested by @superbobry, and expose an API to clear all the caches. What do you think?

@Pierre-Sassoulas
Copy link
Member

Thank you for digging into this @keichi, maybe we can handle 1) by specifying the capacity in @lru_cache and adding an API to change it programmatically ? For 2) I don't know the implication of removing it, seems like it's all or nothing ? For 3/ and 4/ we can add an api to clear all caches instead of having to call multiple function ? Is this what you had in mind ?

@char101
Copy link
Author

char101 commented Nov 8, 2021

Another option is to use sqlite or a key value database to store the cache (possibly compressed), then memory wouldn't be a problem. It might also help speed up execution of the cli command.

@superbobry
Copy link
Contributor

superbobry commented Nov 9, 2021

@char101switching to SQLite would hide the problem instead, but it wouldn't fix it. Any unbounded cache indexed on an astroid AST node is leaky when pylint is used as a library. On top of that, bounded caches could also be expensive if they retain a reference to the linter (indirectly ofc).

@keichi are you willing to give the common caching mechanism a go? Happy to review the PR if @Pierre-Sassoulas agrees it is worth exploring.

@Pierre-Sassoulas my suggestion was to add an API for flushing all caches. I don't have a use-case for changing cache capacity on the fly, but perhaps others do? I would say, thought that all caches should be bounded and even with that it might still be desirable to flush them periodically/in between API calls to reduce memory usage.

@Pierre-Sassoulas
Copy link
Member

Happy to review the PR if @Pierre-Sassoulas agrees it is worth exploring.

Yes, continuously increasing cache without any solution to clear it is bad :)

I don't have a use-case for changing cache capacity on the fly

Not on the fly, but I guess astroid run on pretty various architectures. If it's hard coded the question of what bound should be applied exactly will rise. The default value we could use is probably in the range of 2g - 8g, but being able to change the value if you have 256g of RAM will arise at some point I assume.

@char101
Copy link
Author

char101 commented Nov 9, 2021

The module ast references each other right? Trimming the cache might not necessarily frees the objects unless the whole cache is cleared.

Removing the module ast from the cache while it is still being referenced by other modules could result in memory leak.

So to be safe it might be necessary to loop the cache, use the gc module to find the objects where the referer count is only 1 and remove only those objects from the cache. And this process might need to be run several times until the target memory is reached.

@keichi
Copy link
Contributor

keichi commented Nov 9, 2021

I'd be happy to draft a PR to introduce a new caching mechanism with help from you guys.

I agree the capacity should be configurable but a flushing API is also necessary in certain use cases. For example, I am writing a tool that parses every file in a repository for every git commit, where I need to empty the cache when checking out a new commit.

My suggestion would be to introduce a cache with a user-configurable capacity and a function to flush all caches.

@char101
Copy link
Author

char101 commented Nov 9, 2021

All the modules AST references each other in what become a large graph. To trim a graph we need to start from the leaf nodes so IMO the focus should not be on the cache mechanism only since the cache mechanism are only part of the graph.

Alternatively the modules AST should refer to each othes using a weakref, and when the weakref is null, then trigger a module reparse.

I think the cache is simply a key index (where the key = module name) to an in-memory graph database.

@keichi
Copy link
Contributor

keichi commented Nov 9, 2021

@char101 Are you talking about AstroidManager.astroid_cache? I don't mean to touch it.

The problems here are @lru_cache and @_cached_generator because they create references from the global namaespace that cannot be removed in the current design. Even AstroidManager.clear_cache() will not remove these references.

Any circular references between the graph nodes are fine because the nodes will be eventually collected by the GC once they become unreachable. @lru_cache and @_cached_generator create references from the global namespace and thus a problem.

@superbobry
Copy link
Contributor

I've found another suspicious bit:

https://github.com/PyCQA/astroid/blob/47e860f7d5c73d53de77e7c450535e709e5ef99e/astroid/interpreter/objectmodel.py#L95-L97

This mutates self. However, some object models seem to be class-level descriptors, e.g.

https://github.com/PyCQA/astroid/blob/47e860f7d5c73d53de77e7c450535e709e5ef99e/astroid/nodes/scoped_nodes.py#L2135-L2159

I was able to reproduce a leak even with all caches explicitly flushed by linting just class A: ....

@keichi as a slight aside, maybe a better way to solve the caching issue once and for all is to introduce a concept of session and only allow caching things in the session object?

@superbobry
Copy link
Contributor

Changing ObjectModel.__call__ to return a new instance helps, but there seem to still be other leaks. So, even after doing this and all of the steps outlined by @keichi above, the memory still goes up (and you could see the number of AST nodes in gc.get_objects() ~doubling at every lint call).

@keichi
Copy link
Contributor

keichi commented Nov 26, 2021

Hmm strange, I knew there was a small leak even after clearing the cache but it was small enough that it was practically not a problem (I'm able to parse 100K files as I wrote earlier).

The session approach would be ideal in the long term. It would involve a lot of refactoring and breaking changes to the API though. Maybe we can base it off of AstroidManager?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🪳 Help wanted 🙏 Outside help would be appreciated, good for new contributors topic-performance
Projects
None yet
6 participants