Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #180

Closed
dmpetrov opened this issue Jul 4, 2016 · 19 comments · Fixed by #181
Closed

Memory leak #180

dmpetrov opened this issue Jul 4, 2016 · 19 comments · Fixed by #181
Labels
Milestone

Comments

@dmpetrov
Copy link

dmpetrov commented Jul 4, 2016

It looks like pywt has a memory leak. When I process more than 200K small images (<10 KB each) a process takes all 16GB of my memory and stops (or slows down).

  • Code

whash() function from imagehash lib:
JohannesBuchner/imagehash@da9386d

Reproduction code snippet:

res = {}
i = 0
for filename in onlyfiles:
    fname = os.path.join(filedir, filename)
    try:
        image = PIL.Image.open(fname)
        whash = imagehash.whash(image)
        value = whash.hash.flatten()
        res[int(filename[:-4])] = sum(1<<i for i, b in enumerate(value) if b)
    except Exception as e:
        print("Error. File " + fname + ": ", e )

Btw.. imagehash.phash() function doesn't use pywt and doesn't have this issue.

  • Data

Image subset from Avito competition:
https://www.kaggle.com/c/avito-duplicate-ads-detection

@JohannesBuchner
Copy link

Perhaps it is more helpful to write a test case code which does not use imagehash to show that the issue is indeed in pywt.

@JohannesBuchner
Copy link

Does the memory leak for all kinds of modes/methods or only one in particular?

@JohannesBuchner
Copy link

Candidates could be these two functions, which allocate a Wavelet object, and one should double-check that these are being de-allocated.

wavedec2
https://github.com/PyWavelets/pywt/blob/416c697d612ac75acfc3a216a193b2582701b62b/pywt/_multilevel.py

idwtn
https://github.com/PyWavelets/pywt/blob/416c697d612ac75acfc3a216a193b2582701b62b/pywt/_multidim.py

@dmpetrov
Copy link
Author

dmpetrov commented Jul 4, 2016

Reproduction code snippet without imagehash (files - list of input files, 200K+ files):

mode = 'haar'
level = 3
image_scale = 32
ll_max_level = int(numpy.log2(image_scale))
dwt_level = ll_max_level - level

for fname in files:
    image = PIL.Image.open(fname)
    image = image.convert("L").resize((image_scale, image_scale), PIL.Image.ANTIALIAS)
    pixels = numpy.array(image.getdata(), dtype=numpy.float).reshape((image_scale, image_scale))
    pixels /= 255
    coeffs = pywt.wavedec2(pixels, mode, level = dwt_level)
    dwt_low = coeffs[0]
    med = numpy.median(dwt_low)
    diff = dwt_low > med
    # Ignore result

I saw the leaking issue with 'haar' and 'db4' modes.

@kwohlfahrt
Copy link
Member

kwohlfahrt commented Jul 4, 2016

Can reproduce, will investigate.

import numpy as np
import pywt

pixels = np.ones(32, dtype=np.float)
w = pywt.Wavelet('haar')

for i in range(100000):
    pywt.dwtn(pixels, w)

Throwing in a gc.collect() doesn't help. pywt.dwt is OK (no leak).

Some trivial poking about shows about 0.4kB leaked / iteration. Not affected by size of the wavelet or data. Affects at least both the 'sym' and 'db' classes. Haven't checked any others.

pympler.muppy shows a leaked array + memoryview containing a single integer. Could probably fill out 512 bytes with just that, so I think we've found our culprit. Now to track down exactly where it's coming from. Integer depends on wavelet type, 16 for 'haar' and 'db1', and 20 for 'db5'.

Note the following C does not leak:

#include <string.h>
#include <stdio.h>
#include "wt.h"

int main() {
    size_t N = 128;
    double * data = malloc(N * sizeof(*data));

    pywt_index_t strides[] = {sizeof(*data)};
    size_t data_shape[] = {N};
    ArrayInfo data_info = {
        .shape = data_shape,
        .strides = strides,
        .ndim = 1,
    };

    Wavelet * w = wavelet('d', 4);
    MODE mode = MODE_ZEROPAD;
    size_t O = dwt_buffer_length(N, w->dec_len, mode);
    double * output = malloc(O * sizeof(*output));
    size_t output_shape[] = {O};
    ArrayInfo output_info = {
        .shape = output_shape,
        .strides = strides,
        .ndim = 1,
    };

    int r = double_downcoef_axis(data, data_info, output, output_info,
                                 w, 0, COEF_APPROX, MODE_ZEROPAD);
    free_wavelet(w);
    free(data);
    free(output);
    return r;
}

@kwohlfahrt kwohlfahrt added the bug label Jul 4, 2016
@kwohlfahrt kwohlfahrt added this to the v0.5.0 milestone Jul 4, 2016
@kwohlfahrt
Copy link
Member

kwohlfahrt commented Jul 4, 2016

Gotcha. It's these two lines in dwt_axis responsible, I'll figure out what to do about them in the morning.

    output_shape = (<size_t [:data.ndim]> <size_t *> data.shape).copy()
    output_shape[axis] = common.dwt_buffer_length(data.shape[axis], wavelet.dec_len, mode)

Fix ready, just looking for similar cases in the rest of the code. Seems like an explicit input_shape object is necessary, otherwise things don't get free'd when they should. I dislike Cython...

@stuaxo
Copy link

stuaxo commented Jul 5, 2016

Is that a known issue in Cython ? Could be worth raising it there ?

@kwohlfahrt
Copy link
Member

kwohlfahrt commented Jul 6, 2016

@stuaxo yes, I will raise it upstream, I have a minimal example here. http://trac.cython.org is dead at the moment though. I have posted to cython-users, I think it is pending approval.

However, the workaround isn't too bad, so I'm happy using that for now.

@stuaxo
Copy link

stuaxo commented Jul 6, 2016

Cheers :) only mentioned as it gave me a horrible flashback to a similar bug I had before, but never tracked down in a different library.

@kwohlfahrt
Copy link
Member

kwohlfahrt commented Jul 6, 2016

@dmpetrov could you please double check that this is fixed now?

@dmpetrov
Copy link
Author

dmpetrov commented Jul 7, 2016

I've checked the master branch version. Now it works much better. For one specific scenario it uses 400MB instead of 1.2GB of RAM.

However, the memory usage is still increasing. The growth rate is 3-10 times less then it use to be. I'm not sure if it is another issue in the library or just a regular Python behavior.

Anyway now we can process million of images (one by one) with less then 1Gb of memory (less than 700MB to be precise). So, I think we can close the issue.

@kwohlfahrt
Copy link
Member

That still shouldn't be happening. I'll re-open this so I remember to chase it down.

@kwohlfahrt kwohlfahrt reopened this Jul 7, 2016
@JohannesBuchner
Copy link

@dmpetrov, could you try to do gc.collect() in your loop?

@dmpetrov
Copy link
Author

dmpetrov commented Jul 7, 2016

I'll try today night.

Sent from my iPhone

On Jul 7, 2016, at 7:02 AM, Johannes Buchner notifications@github.com wrote:

@dmpetrov, could you try to do gc.collect() in your loop?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@dmpetrov
Copy link
Author

dmpetrov commented Jul 8, 2016

tried gc() in the loop - the same result

@kwohlfahrt
Copy link
Member

I am unable to reproduce the issue with the following script, with 10 000 000 loop iterations memory use stays constant. Perhaps the bug is in PIL.

#!/usr/bin/env python3

import numpy as np
import pywt
import sys

mode = 'haar'
level = 3
image_scale = 32
ll_max_level = int(np.log2(image_scale))
dwt_level = ll_max_level - level

for _ in range(int(sys.argv[1])):
    pixels = np.ones((image_scale, image_scale), dtype="float")
    coeffs = pywt.wavedec2(pixels, mode, level = dwt_level)
    dwt_low = coeffs[0]
    med = np.median(dwt_low)
    diff = dwt_low > med
    # Ignore result

@dmpetrov
Copy link
Author

dmpetrov commented Jul 8, 2016

@kwohlfahrt you are right. I replaced image reading part to image copy and now the memory usage is constant.

Do you know where is the correct PIL repository for opening a bug https://github.com/python-pillow/Pillow ?

@kwohlfahrt
Copy link
Member

That looks like the right place (if you are using pillow instead of the old PIL). I haven't used it much to be honest. Closing this now, thanks for reporting and following up :)

@dmpetrov
Copy link
Author

dmpetrov commented Jul 8, 2016

Thank you for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants