Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Astroid is leaking memory #1780

Closed
matusvalo opened this issue Sep 13, 2022 · 18 comments · Fixed by #2072 or #2303
Closed

Astroid is leaking memory #1780

matusvalo opened this issue Sep 13, 2022 · 18 comments · Fixed by #2072 or #2303
Assignees
Milestone

Comments

@matusvalo
Copy link

Astroid is leaking memory when is run multiple times over same file using API

Steps to reproduce

  1. Pick larger file, e.g. file from pylint project: pylint/checkers/variables.py
  2. Run following script:
from pylint.lint import Run
from pylint.lint import pylinter
while True:
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

Current behavior

After ~12 iteration memory goes over 1 GB and keeps growing
image

Expected behavior

Allocated memory should not rise indefinitelly

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.13.0-dev0

@matusvalo
Copy link
Author

matusvalo commented Sep 13, 2022

I did some investigation - I profiled memory using muppy:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

for _ in range(15):
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

I got following result:

                                  types |   # objects |   total size
======================================= | =========== | ============
                                   dict |      221794 |     60.35 MB
                                   list |      212895 |     14.44 MB
                                    str |       67663 |      5.82 MB
                                    int |      185941 |      4.97 MB
                                  tuple |       52200 |      2.78 MB
        astroid.nodes.node_classes.Name |       39949 |      1.83 MB
       astroid.nodes.node_classes.Const |       36755 |      1.68 MB
                                   code |        8153 |      1.56 MB
                     tokenize.TokenInfo |       16652 |      1.40 MB
                                   type |        1311 |      1.18 MB
                                 method |       13220 |    826.25 KB
   astroid.nodes.node_classes.Attribute |       12100 |    567.19 KB
  astroid.nodes.node_classes.AssignName |       11793 |    552.80 KB
        astroid.nodes.node_classes.Call |        9234 |    432.84 KB
                                    set |         555 |    389.32 KB

                                                types |   # objects |   total size
===================================================== | =========== | ============
                                                 dict |     2599736 |    764.26 MB
                                                 list |     2394537 |    171.48 MB
                                                  int |     2635012 |     70.37 MB
                                                  str |      387148 |     32.07 MB
                      astroid.nodes.node_classes.Name |      522253 |     23.91 MB
                     astroid.nodes.node_classes.Const |      375519 |     17.19 MB
                                               method |      201034 |     12.27 MB
                                                tuple |      231507 |     12.12 MB
                 astroid.nodes.node_classes.Attribute |      175345 |      8.03 MB
                astroid.nodes.node_classes.AssignName |      161694 |      7.40 MB
                      astroid.nodes.node_classes.Call |      127355 |      5.83 MB
                 astroid.nodes.node_classes.Arguments |       85926 |      3.93 MB
  astroid.nodes.scoped_nodes.scoped_nodes.FunctionDef |       84896 |      3.89 MB
                                      FunctionWrapper |       37788 |      3.46 MB
                                   tokenize.TokenInfo |       39786 |      3.34 MB

Clearly majority of allocated data are dictionaries and lists...

@Pierre-Sassoulas
Copy link
Member

It's hard to know what is inside the dict/list without being able to explore the data. But it's probably part of the AST that are not cleaned up properly, right ?

@matusvalo
Copy link
Author

matusvalo commented Sep 13, 2022

It's hard to know what is inside the dict/list without being able to explore the data.

I am able to get to that information. If you run following script:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary
from pympler import asizeof

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

obj = []
for o in all_objects:
    if not isinstance(o, dict):
        continue
    try:
        # some dictionary items are raising exception when accessed
        _ = str(o)
    except Exception:
        continue
    obj.append(o)

Please notify that cache was cleaned so, in theory the memory should not contain any extra astroid information about AST.

You get a lot of following dictionaries:

>>> pprint(obj[-5:])
[{'col_offset': 29,
  'end_col_offset': 33,
  'end_lineno': 1674,
  'fromlineno': 1674,
  'lineno': 1674,
  'name': 'node',
  'parent': <Attribute.name l.1674 at 0x1125b3490>,
  'position': None},
 {'col_offset': 37,
  'end_col_offset': 42,
  'end_lineno': 1436,
  'fromlineno': 1436,
  'lineno': 1436,
  'name': 'nodes',
  'parent': <Attribute.FunctionDef l.1436 at 0x112589fd0>,
  'position': None},
 {'col_offset': 46,
  'end_col_offset': 59,
  'end_lineno': 1876,
  'fromlineno': 1876,
  'lineno': 1876,
  'name': 'default_names',
  'parent': <Call l.1876 at 0x111e92fd0>,
  'position': None},
 {'col_offset': 64,
  'end_col_offset': 68,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'node',
  'parent': <Attribute.node_ancestors l.2105 at 0x11260cfa0>,
  'position': None},
 {'col_offset': 31,
  'end_col_offset': 39,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'ref_node',
  'parent': <Attribute.parent l.2105 at 0x11260ce80>,
  'position': None}]
>>> len(obj)
216564
>>> len([i for i in obj if 'position' in i])
160789
>>> len([i for i in obj if 'parent' in i])
161191

Could you help me understand where in the code base these dictionaries can be created?

@Pierre-Sassoulas
Copy link
Member

It does look like the AST, so probably astroid.parse (?) and/or possibly astroid.NodeNG.infer (?)

@DanielNoord
Copy link
Collaborator

Those dictionaries are the arguments that get passed to postinit functions I think. At least, that is where their reference is according to the gc module.

I tried to do some digging, but I really have no experience here. The only thing I could come up with is that something is sustaining a reference to a number of nodes.NodeNG objects which prevent them from being garbage collected.

import gc
from io import StringIO

from pylint.lint import Run, pylinter
from pylint.reporters.text import TextReporter
from pympler import muppy, summary

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py"], exit=False, reporter=TextReporter(StringIO())
    )
    pylinter.MANAGER.clear_cache()
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)

This shows that the increase in both node and dict objects is constant after initial startup.

python test2.py

Nodes INC: 137841
Nodes NEW: 137841
Dicts INC: 68924
Dicts NEW: 68924

Nodes INC: 112234
Nodes NEW: 250075
Dicts INC: 51192
Dicts NEW: 120116

Nodes INC: 112234
Nodes NEW: 362309
Dicts INC: 51191
Dicts NEW: 171307

Nodes INC: 112234
Nodes NEW: 474543
Dicts INC: 51191
Dicts NEW: 222498

Nodes INC: 112234
Nodes NEW: 586777
Dicts INC: 51191
Dicts NEW: 273689

As far as my understanding of garbage collections goes this show that the persistence of these objects does not come from the garbage collection randomness but from actual references to these objects where they probably shouldn't be referenced.

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible... Perhaps somebody has a good strategy to understanding why the gc doesn't collect certain objects?

@jacobtylerwalls
Copy link
Member

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible...

I think that's exactly what's going on. The child nodes are still holding references to the module in .parent, so when the linter's refcount goes to zero, the module's refcount never has a chance to go to zero also. Seems like a classic use case for weakrefs.

@DanielNoord
Copy link
Collaborator

That would be a massive breaking change though... So not sure it is feasible. I would definitely want to help explore it though.

@jacobtylerwalls jacobtylerwalls added this to the 3.0 milestone Mar 4, 2023
@jacobtylerwalls
Copy link
Member

Yep, 3.0 or nothing! ;)

@DanielNoord
Copy link
Collaborator

I think I ran into some issues with the WeakRef PR though as it breaks some use cases. Not sure it would actually work..

@jacobtylerwalls
Copy link
Member

music21 has a very similar use case, with a set of weakref tools here, and then (in some cases) properties for syntactic sugar when setting / retrieving parents.

One track to explore: we could make NodeNG.parent a property, and have its getter and setter call similar utilities as I linked above. The BC implications would be so small we may not even need to wait for 3.0: calling .parent would still give NodeNG or None.

@DanielNoord
Copy link
Collaborator

Sounds like it is worth exploring. Is that something you would want to build?

@jacobtylerwalls
Copy link
Member

Yep, I'm already seeing some promising results. Looking to do a sequence of small PRs.

@jacobtylerwalls
Copy link
Member

Here's a version of Daniel's script from #1780 (comment) that takes care to delete the references created in the measurements:

import gc

from pylint.lint import Run
from pympler import muppy

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py", '--clear-cache-post-run=y'], exit=False
    )
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)
    del all_objects
    del o

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Mar 12, 2023

Looks like we've solved the node leaking, just dicts are left:

$ python3 memory.py

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 102907
Nodes NEW: 102907
Dicts INC: 65091
Dicts NEW: 65091

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11472
Dicts NEW: 76563

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 88023

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 99483

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 110943

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Mar 12, 2023

The remaining uses seem to derive from the custom caching decorator in astroid. Looking into converting those uses to just cached_property.

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Mar 12, 2023

If we want to close this out, here are some next steps for 3.0.

The two remaining offenders are _get_assign_nodes() and slots(). We can covert them to cached_property, but we'll need to change how they're called, i.e. from slots() to slots, unless we create a wrapper around cachedproperty, but that got us into this trouble in the first place.

Unfortunately, there's some hacking around slots that may not make this possible, so we need to decide/benchmark if caching slots is ultimately even worth it (and we can avoid a deprecation in that case):

https://github.com/PyCQA/astroid/blob/cd0f896672049493b2e3cfc87a327c871f8dd329/astroid/brain/brain_typing.py#L169-L173

So, I suggest:

  • deprecate astroid.decorators.cached
  • remove astroid._cache.CacheManager
  • Benchmark if caching slots() is even worth it
  • Convert _get_assign_nodes() to cached_property

@DanielNoord
Copy link
Collaborator

We don't need to deprecate _cache. All else sounds good, I don't mind breaking APIs as long as pylint benefits from it.

@jacobtylerwalls
Copy link
Member

Oh, right, and _get_assign_nodes is private, duh! Edited to-do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment