Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] Refcount leak to underlying array when deleting dataframe #2323

Open
schwingkopf opened this issue Jan 5, 2023 · 8 comments
Open

Comments

@schwingkopf
Copy link

I'm trying to use vaex with numpy arrays that reference shared memory and experience problems when trying to unlink the shared memory. Here a minimal reproducing example:

import numpy as np
from multiprocessing import shared_memory
import time
import vaex

shm = shared_memory.SharedMemory(create=True, size=8)
arr = np.frombuffer(shm.buf, dtype="uint8", count=8)
df = vaex.from_dict(dict(x=arr))

del arr
del df
time.sleep(2)

shm.close()
shm.unlink()

Execution throws the following exception:

Traceback (most recent call last):
  File "<...>\memory_test.py", line 15, in <module>
    shm.close()
  File "<...>\.pyenv\pyenv-win\versions\3.9.10\lib\multiprocessing\shared_memory.py", line 227, in close
    self._mmap.close()
BufferError: cannot close exported pointers exist

It works fine when not creating the dataframe object.

It seems like vaex is still keeping a reference to the array/shm block after deleting the dataframe object. Is that a bug or is there a recommended way to delete all references?

Software information

  • Vaex version (import vaex; vaex.__version__): {'vaex': '4.16.0', 'vaex-core': '4.16.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}
  • Vaex was installed via: pip
  • OS: Windows 10
  • Python: 3.9.10
@schwingkopf
Copy link
Author

Just realized the problem from my inital post is much simpler to explain:
There seems a refcounting leak in vaex dataframe:

import numpy as np
import vaex
import sys

arr = np.arange(10)
print(f"Refcount after array creation: {sys.getrefcount(arr)}")

df = vaex.from_dict(dict(x=arr))
print(f"Refcount after df creation: {sys.getrefcount(arr)}")

del df
print(f"Refcount after df deletion: {sys.getrefcount(arr)}")

prints:

Refcount after array creation: 2
Refcount after df creation: 3
Refcount after df deletion: 3

So dataframe deletion is not cleaning up its reference to the array.
Is that a bug or is there any other recommended way to release the array?

@schwingkopf schwingkopf changed the title [BUG-REPORT] Problems unlinking shared memory of arrays after dataframe deletion [BUG-REPORT] Refcount leak to underlying array when deleting dataframe Jan 15, 2023
@schwingkopf
Copy link
Author

ok, think this is not a bug in vaex, but related to delayed garbage collection in python.
Manually triggering garbage collection using gc.collect() after del df fixes the issue in both examples above.

Although I did not understand why garbage collection is delayed after going through vaex dataframe I will close the issue as most likely not related to vaex internals.

@schwingkopf
Copy link
Author

schwingkopf commented Jan 21, 2023

After digging deeper into python garbage collection internals I think I closed this one too early..

Using tricks from https://rushter.com/blog/python-garbage-collector/ I can see that the Dataframe object after del df still has a non-zero refcount:

import time
import vaex
import ctypes
import gc

class PyObject(ctypes.Structure):
        _fields_ = [("refcnt", ctypes.c_long)]

def array_vaex_leak():
    N=int(0.5e9)
    arr = np.arange(N)
    df = vaex.from_dict(dict(x=arr))
    df_addr = id(df)
    print(f"Refcount before delete: {PyObject.from_address(df_addr).refcnt}")
    del df
    print(f"Refcount after delete: {PyObject.from_address(df_addr).refcnt}")
    gc.collect()
    print(f"Refcount after gc collect: {PyObject.from_address(df_addr).refcnt}")

array_vaex_leak()

Outputs:

Refcount before delete: 3
Refcount after delete: 2
Refcount after gc collect: 0

The fact that it gets removed when calling gc.collect() is a strong hint that a cyclic reference exists the object, preventing it's instant removal when calling del df. This needs fixing in vaex code by removing the cyclic referencing or using weakref's!

Effectively this behaves as a memory leak until the python interpreter decides to run garbage collection or user code triggers it explicitly via gc.collect() (which is a relatively costly operation, ~30ms in my example)
This becomes severe when working with large arrays. In the following example code the automatic garbage collecting is not run sufficiently often, such that the system runs out of memory (at least on my machine)

import numpy as np
import time
import vaex
import os
import psutil

def array_vaex_leak():
    N=int(0.5e9)
    arr = np.arange(N)
    df = vaex.from_dict(dict(x=arr))

for i in range(1000):
    array_vaex_leak()
    time.sleep(0.5)
    print(i)
    print(f"{round(psutil.Process(os.getpid()).memory_info().rss / (1024.**3), 3)} Gbyte") 
0
1.983 Gbyte
1
3.846 Gbyte
2
1.983 Gbyte
3
3.846 Gbyte
4
5.709 Gbyte
5
7.571 Gbyte
6
9.434 Gbyte
7
11.296 Gbyte
8
0.259 Gbyte
Traceback (most recent call last):
  File "<...>\gc_play.py", line 30, in <module>
    array_vaex_leak()
  File "<...>\gc_play.py", line 26, in array_vaex_leak
    arr = np.arange(N)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.86 GiB for an array with shape (500000000,) and data type int32

I tried to debug and locate the cyclic reference using objgraph, but did not succeed yet.

Maybe someone more skilled or knowledge of vaex internals could help here?

@schwingkopf schwingkopf reopened this Jan 21, 2023
@maartenbreddels
Copy link
Member

It's a difficult topic for sure!
I have experimented with this in #1824 but I'm not sure why it failed. Maybe this is food for thought?
Let me rebase that PR to see what the failure was.

@anthonycorletti
Copy link
Contributor

im running into a similar error as described in a previous issue #2062. @schwingkopf im curious if downgrading numpy lets your code run successfully?

@anthonycorletti
Copy link
Contributor

numpy 1.23 had lots of changes so if you're using 1.23+ there might be something in there that could be related https://github.com/numpy/numpy/releases/tag/v1.23.0

@schwingkopf
Copy link
Author

@anthonycorletti thanks for your hint. Just tried the example from my first post:

  • Numpy == 1.22: works
  • Numpy == 1.23: does not work

Interesting.. any ideas what that means? For the problem to appear it still requires interaction with a vaex df.

@anthonycorletti
Copy link
Contributor

Interesting.. any ideas what that means? For the problem to appear it still requires interaction with a vaex df.

Happy to hear this at least got something working for you. I'm not exactly sure what this means unfortunately. I know that 1.22.4 has problems with mmap which might be due to this change in numpy numpy/numpy#21446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants