Memory Usage Explosion #753

kuntoaji · 2016-06-13T10:18:24Z

We are experiencing memory usage explosion on Ubuntu 14.04.1 LTS with Thumbor v6.0.1 and Pillow 3.2.0. Our memory usage can hit ~97%. Then it will return to ~60% memory usage and after some time it will hit again ~97% memory usage.

After some investigation, there is possility that resize method in thumbor/engines/pil.py is the root cause. If we run gc.collect() in resize method, our memory usage is never hit 97% again.

Thumbor request URL

Sample url http://localhost:7001/unsafe/1920x1080/filename.jpg

Script to reproduce

#! /usr/bin/env ruby

require 'open-uri'

@total = 0
count = 100000

(5000..(5000+count)).each do |i|
  size = 2000 + rand(2000)
  wxh = [size, size].join('x')
  url = "http://localhost:7001/unsafe/#{wxh}/filename.jpg"
  start_time = Time.now
  open url
  end_time = Time.now
  duration = end_time - start_time
  @total += duration
  puts "got: #{url} in #{duration.round(2)}"
end

puts "That took: #{@total.round(2)} seconds"
puts "Average #{(@total / count).round(2)} per image"

Expected behaviour

Memory usage should below 90%

Actual behaviour

Memory usage hit ~97%

Operating system

14.04.1 LTS
Thumbor v6.0.1
Pillow 3.2.0

thumbor.conf

MAX_AGE = 31536000
ALLOW_ANIMATED_GIFS = False
LOADER = 'thumbor.loaders.file_loader'

STORAGE = 'thumbor.storages.mixed_storage'
SECURITY_KEY = 'example'

ALLOW_UNSAFE_URL = True
FILE_LOADER_ROOT_PATH = '/vagrant/images'

REDIS_STORAGE_SERVER_HOST = 'example.com'
REDIS_STORAGE_SERVER_PORT = 6379
REDIS_STORAGE_SERVER_DB = 0
REDIS_STORAGE_SERVER_PASSWORD = None

MIXED_STORAGE_CRYPTO_STORAGE = 'tc_redis.storages.redis_storage'
MIXED_STORAGE_DETECTOR_STORAGE = 'tc_redis.storages.redis_storage'

DETECTORS = [
   'thumbor.detectors.face_detector',
   'thumbor.detectors.profile_detector',
   'thumbor.detectors.glasses_detector',
   'thumbor.detectors.feature_detector'
]

The text was updated successfully, but these errors were encountered:

matclayton · 2016-06-14T15:51:03Z

We're seeing something similar, not confirmed what is causing it, we just shoot and reboot right now.

lexx27 · 2016-06-19T15:41:09Z

I have a similar problem and the actual usage is relatively low

damaestro · 2016-07-11T18:26:07Z

We are also experiencing this issue. Normal scaling activities normally cycle the affected nodes but see it fairly regularly. We are going to tune down our LRU threshold for redis memory usage to confirm further.

We currently have redis tuned to use 80% of available memory, and see a consistent leak from there. Notice the spike to ~80% and then the slow leak. Not all instances become affected by this either.

kkopachev · 2016-07-11T20:33:24Z

There is also fresh bug report in Pillow repo python-pillow/Pillow#2019

damaestro · 2016-07-16T01:27:59Z

I'm going to spend some more time downgrading through different updates to try to find the start of this issue. We are regularly seeing mem leaking issues causing scaling events and then manual intervention to remove the maxed out nodes. This shows a slow memory climb and then a burst of new nodes being put into service as the existing nodes' performance degrades due to memory contention:

damaestro · 2016-07-19T22:30:37Z

Okay, I think we've confirmed that a gc.collect() does stabilize memory usage, but not in the resize method.

Testing methodology

randomly generated resize request for one image for 1 to 3000px WxH
siege -b -i against those generated urls
three AWS t2.medium instances behind an ELB running thumbor in a docker container (4G of RAM per instance, build log: https://paste.fedoraproject.org/392667/96725214/)
redis caching origin image requests (1G LRU, but only one image source being requested)
two thumbor processes running 8 threads balanced via nginx
thumbor 6.1.2 vs ~master plus damaestro@f6cba5e

Thumbor config:

ALLOWED_SOURCES = ['XXX_ORIGIN_XXX']
OPTIMIZERS = ['thumbor.optimizers.jpegtran']
SENTRY_DSN_URL = 'XXX_SENTRY_DSN_XXX'
STORAGE = 'tc_redis.storages.redis_storage'
REDIS_STORAGE_IGNORE_ERRORS = True
REDIS_STORAGE_SERVER_PORT = 6379
REDIS_STORAGE_SERVER_HOST = 'localhost'
REDIS_STORAGE_SERVER_DB = 0
REDIS_STORAGE_SERVER_PASSWORD = None
ENGINE_THREADPOOL_SIZE = 8
# Custom error handling is currently broken.
#USE_CUSTOM_ERROR_HANDLING = True
ERROR_HANDLER_MODULE = 'thumbor.error_handlers.sentry'
MAX_WIDTH = 3000
MAX_HEIGHT = 3000
MAX_SOURCE_SIZE = 3000
QUALITY = 90

Results

Without gc.collect():

With patch to gc.collect() in Engine.read() from the pil interface:

At least now we have somewhere to look more specifically. The non gc.collect() would eventually run out of memory, we just ended the test within the same timeframe as the patched test.

JawsomeJason · 2016-09-12T05:45:59Z

@damaestro I included your patch at damaestro@f6cba5e, and I'm seeing a drop in memory usage by about 60%. Have you considered a PR for this?

damaestro · 2016-09-13T00:07:14Z

@thejase it's a nasty hack to help track down where the leak is. By no means is it a fix. Something in Engine.read() is leaking and that is all this attempts to show. I am having to run this patch otherwise we are just constantly OOMing workers and auto-scaling. I figured others are running into the same issue so a quick patch to keep things working while the upstream devs look into this was my plan.

My suspicion is file handle leaks from PIL or for some reason img_buffer (a local BytesIO) is leaking and just needs a del or some other magic. I've tested none of this beyond basic profiling and forcing gc.collect().

JawsomeJason · 2016-09-20T00:44:15Z

That's a fair response. Thanks for further insight from your experiences thus far.

mvdklip · 2016-10-20T11:36:59Z

Bumped into this while trying to run some jmeter performance tests. Note that I was still running into OOMs even though I was not hitting the Engine.read() code path (doing meta requests with file storage enabled). Patched it temporarily like so:

diff --git a/thumbor/handlers/__init__.py b/thumbor/handlers/__init__.py
index 61ea916..c18ffdc 100644
--- a/thumbor/handlers/__init__.py
+++ b/thumbor/handlers/__init__.py
@@ -8,6 +8,7 @@
 # http://www.opensource.org/licenses/mit-license
 # Copyright (c) 2011 globo.com timehome@corp.globo.com

+import gc
 import sys
 import functools
 import datetime
@@ -311,6 +312,10 @@ class BaseHandler(tornado.web.RequestHandler):
                 if should_store:
                     self._store_results(context, results)

+                collected = gc.collect()
+                if collected > 10:
+                    logger.warn('Garbage collector: collected %d objects.' % collected)
+
         self.context.thread_pool.queue(
             operation=functools.partial(self._load_results, context),
             callback=inner,

thumbor#753

philipbjorge · 2016-11-23T18:53:06Z

I'll add to this, we're currently running with this patch in production and it causes a marked increase in CPU load such that we've had to double the number of worker instances we normally required.

mvdklip · 2016-11-24T09:25:35Z

Improved patch as listed above to run garbage collection less often. Garbage collection is still triggered by finish_request() inside handlers but now it will not run more often than defined by the GC_INTERVAL config variable. The default is to run every 60 seconds or less (depending on the amount of requests coming in).

thumbor#753

okor · 2017-03-03T15:12:38Z

@damaestro Did you figure out at what version of Thumbor/Pillow this problem began to appear? Thanks

damaestro · 2017-03-07T02:17:28Z

@okor no, I didn't. I unfortunately got distracted and just took the CPU hit and use my patch. I'm glad to see you taking interest in this. It's been a long-standing issue.

okor · 2017-03-07T16:10:57Z

@damaestro So you did actually see an increase in CPU? In my local/staging/production I didn't actually notice any increase in CPU. Could just be our setup I guess.

@mvdklip Did you notice a significant change is CPU usage after applying your patch?

okor · 2017-03-07T17:16:02Z

@scorphus What would you think of releasing an official patch for this issue? If we can't find a better solution I think it would be a practical decision to include some variant of the manual GC. And as expectations go I think people would in large part be happier to give up a little CPU than to have thumbor nodes swapping/crashing/etc.

I deployed a patch yesterday on our cluster, to one of our nodes. Here are the results:

Before the deploy, the large changes in memory usage were from manual restarts : /

scorphus · 2017-03-07T18:01:39Z

@okor: I think we can include these patches in the upcoming release. I'm reviewing all changes since last release, 6.2.1, and will review this as well.

damaestro · 2017-03-07T22:23:39Z

@okor nothing terrible, but there is some overhead. It would be nice to find a more appropriate place to do the gc.collect() more dynamically, less often.

okor · 2017-03-08T03:28:24Z

@damaestro I am certainly open to suggestions. Did you have somewhere else in mind?

I'm honestly not sure where would be best and just used the location that @mvdklip used nrcmedia@59b0595#diff-96c1f5e9855ad12779e42ade3a0c99e1R140

damaestro · 2017-03-08T04:09:53Z

Likely the path of least resistance is to have a tunable, such as the proposed GC_INTERVAL configuration option with a high default. If we got fancy, it would need to be based on a governor that cares about active memory vs available. Scheduling something for now based on a configurable seems like a good short-term compromise.

Where I hesitate is that this was introduced at some point. This is a side-effect of a bug in another location so I've been waiting for the actual bug to be identified. However, this would not be the first time software needs to be defensive due to an external defect.

okor · 2017-03-08T04:44:09Z

Just for fun, I dug a little deeper and it's a bit weirder than I thought it would be. According to the gc module there is no leak. All objects are "collectable". Logging objects before/at execution/after manual gc, everything looks great.

The only behavior that is even slightly suspicious is when I hit Thumbor with a lot of concurrent requests the number of objects which can and will be gc'ed grows more than it would if I just sent 1 request at a time. For debugging purpose, I put the gc at the end of the finish_request method. So it makes sense ... that some objects will pile up from multiple un-finished requests .. since Tornado is using a non-blocking event loop.

If I was to suspect anything ... I bet the changes between thumbor ~ 5.2 and ~ 6 ... which if I recall were in part "more async/seriously not blocking this time" are possibly just giving the illusion that there is a leak. And perhaps the garbage collector just isn't tuned by default in a way that the newer, less blocking Thumbor needs. I think this is what I will believe unless I see evidence to the contrary.

I'll keep on eye on Thumbor mem/cpu in production over the next couple of weeks. If my theory is correct then Thumbor should stay fairly flat. If anyone else has info / stats to share, please don't be shy : )

okor · 2017-03-08T04:53:59Z

@damaestro I super appreciate the suggestions and feedback 😄

This is a side-effect of a bug in another location so I've been waiting for the actual bug to be identified.

I'm not sure that's actually the case. If you are referring to the Pillow issue in which unclosed files were raising an error/warning that was filling up some buffer (lol that bug, ping pong between parties) ... I don't think it can be the case for me at least.

I tried to replicate the problem quite a few times using the same packages as production in several small scripts. I saw no indiction this was an issue with Python 2.7 and Pillow 3.4.1/3.4.2 ... and one of the Pillow maintainers was only able to reproduce the issue with Python 3+. So as far as I can tell this is on Thumbor. I'm happy to be proven wrong of course.

damaestro · 2017-03-08T05:37:51Z

@okor Just trying to help. Have kept my eye on this for a while so I'm glad you are engaged with figuring this one out, I ran out of steam. I was not suggesting where the memory hog is located (inside or outside of Thumbor), just more that we are discussing a portion of code that I don't think is causing the issue. It goes back to the question of "where is the best place to force a garage collection?". It does seem that we have a delta to look at, that being Thumbor 5.x to 6.x. Where this gets challenging is the external dependencies and build deps.

Someone with time should be able to increment through a known non-memory-issue version and just keep testing with the methodologies proposed here. Where I would start would be the last known version that does not show this memory profile and rather than upgrading Thumbor, upgrading everything it depends on. If we can validate the old version works with all the latest deps (yes, I assuming we'd have to backport certain api/abi interfaces, but this could be managed)... we can at least isolate this is something in Thumbor or not. Minimally, we'd find that a backported patch is the cause. </$0.02>

mvdklip · 2017-03-08T07:12:18Z

@okor Yes, memory usage dropped (and stabilized) significantly after applying the patch.

okor · 2017-04-10T16:44:50Z

People on this thread are encouraged to leave comments on that PR ^

savar · 2017-04-10T19:45:18Z

This is the memory usage if using OpenCV Engine. So I guess it is not pillow related?

(you see 8 thumbor instances and the drops are restarts during a deploy)

okor · 2017-04-10T19:48:16Z

@savar yeah, I couldn't find any correlation with Pillow as I explained in the PR #905

SergioJorge · 2017-06-23T14:54:47Z

Hi, guys.

This problem will be fixed in version 6.3.3.
Thank you all !

kkopachev · 2017-06-23T18:40:16Z

@SergioJorge how was it fixed?

SergioJorge · 2017-06-23T18:51:01Z

@kkopachev

In PIL Engine the module warnings is called for each request. Thats make the list of warnings grow to infinity.
You can see the code here.

SergioJorge · 2017-06-23T18:54:22Z

Sorry to close this issue.
I'm still not very accustomed to this development flow.
Can you test to see if that solves your problems?

matclayton · 2017-06-23T18:59:33Z

Happy to, but it'll likely be Monday for me, any idea when 6.3.3 will ship?

…

On 23 June 2017 at 19:54, Sergio Jorge ***@***.***> wrote: Sorry to close this issue. I'm still not very accustomed to this development flow. Can you test to see if that solves your problems? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#753 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHtCjH5gQSdSYh3mJWzicySmGojL8NCks5sHAnjgaJpZM4I0L11> .

-- -- Matthew Clayton | Co-Founder Mixcloud Limited mixcloud http://www.mixcloud.com/mat twitter http://www.twitter.com/matclayton email mat@mixcloud.com mobile +44 7872007851 skype matclayton

SergioJorge · 2017-06-23T19:03:13Z

@matclayton

Let's wait for the community to see if this solution solves this problem.

kkopachev · 2017-06-23T19:22:45Z

That change definitely helps with memory usage, but I'm not sure if that was the cause for original issue.
Some people report above that the issue also happens when they use openCV engine, which probably does not involve loading pil engine (not sure here). Other reason, is that there are reports that manual gc run fixes (mitigates) the issue, so I created this little test script to reproduce the issue and see if gc can help:

import gc
import warnings
import resource

for i in range(1000000):
    if (i % 1000 == 0):
        if (i % 10000 == 0):
            collected = gc.collect()
            print ('Collected %s' % collected)
        print ("Used %d" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    warnings.simplefilter("error", Image.DecompressionBombWarning)

rss grows over time, each loop becomes slower then previous, but gc collects 0 objects and does not make any difference. Eg removing gc.collect does not change anything.

matclayton · 2017-06-26T20:46:56Z

We've tested this in production (100+ instances) and the RAM usage looks better, but not perfect yet! We see processes going upto 500Meg of RAM, where our OOM killer kicks in and reboot them. This definitely seems to have slowed down a lot with this commit, but its still happening a bit.

masom · 2017-06-27T16:02:03Z

https://github.com/stackimpact/stackimpact-python might be useful to test what is happening

aabhassharma · 2017-08-23T00:45:45Z

Any progress here? We run a trivial cluster (20 instances) so I'm not sure if we're the best load testers, but can setup some benches if needed.

matclayton · 2017-08-23T07:59:21Z

We're running a few hundred nodes and are still seeing problems (sha:2e9b9095b5413eb6108030f67705f992236227dc), it does however seem better, but still not fixed. Mat

…

On 23 August 2017 at 01:45, Aabhas Sharma ***@***.***> wrote: Any progress here? We run a trivial cluster (20 instances) so I'm not sure if we're the best load testers, but can setup some benches if needed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#753 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHtCpDZ2vEXXEFplS4vgPcfx-yWAun7ks5sa3ZAgaJpZM4I0L11> .

-- -- Matthew Clayton | Co-Founder Mixcloud Limited mixcloud http://www.mixcloud.com/mat twitter http://www.twitter.com/matclayton email mat@mixcloud.com mobile +44 7872007851 skype matclayton

kkopachev · 2017-11-02T23:07:19Z

I dig around with disabled garbage collector and TestRequestHandler which runs gc.collect() and outputs number of objects collected and found interesting thing.
I created 2 handlers

# server.py
import gc
gc.disable()
gc.set_debug(gc.DEBUG_LEAK)

# app.py
class TestHandler(BaseHandler):
    def get(self):
        self.write('{} objects collected'.format(gc.collect()))

class HealthHandler(ContextHandler): # ContextHandler is important here
    @tornado.web.asynchronous
    def get(self):
        self.finish('works')

# handlers added like this:
(r'/health', HealthHandler, {'context': self.context}),
(r'/test', TestHandler),

And ran curl curl localhost:8888/health - it showed some objects collected first time, then 0 for subsequent requests.
Then I ran curl localhost:8888/test and right after curl localhost:8888/health - it showed that a lot of objects were collected.
That seemed weird, why they are not removed, if there should not be any references to them?
So in ContextHandler::on_finish I added few lines:

        self.context.filters_factory = None
        self.context.metrics = None
        self.context.modules = None
        self.context.thread_pool = None
        self.context.request = None
        self.context.transformer = None
        self.context = None

and that solved the issue for my simple /test request.

The problem with context is that thumbor creates Context object when starts Server, then for each ContextHandler descendant server passes Server's Context object. Then ContextHandler in initialize picks some values from Server's Context object and creates new Context, which also has a reference to request_handler.
Now handler's context has reference to request_handler and request_handler has reference to context. Then Context.modules contains list of module objects, each of them contains reference to context.

It all could be solved by clearing self.context in on_finish method of request handler. But... When result_storage are in use, request handler, when done with image operations, serves result image to the client, finishes request (which calls on_finish method) and then schedules task on IOLoop to store result bytes in result storage. If I clear self.context in on_finish, result_storage module, which holds a reference to the same context, will fail on saving result data, since it uses context.request.url as a filepath where to store the data. context object will be cleared already at this point.

So, this shows couple issues. First is that thumbor creates Context (which initializes modules, config and such) and then ContextHandler creates it's own Context. Not a big deal in terms of leaks, but weird that it does re-create same objects again, which does not necessarily depend on request-specific parameters.
Second is that this context object has circular references to many objects inside.

Not quite sure how to solve that interleaved references issue.

kkopachev · 2017-11-03T00:05:27Z

Turns out, that I can't remove self.context.request since request is used in result_storage, but I can unset self.context.request.engine which holds reference back to context. That allows result storage still figure out request parameters and gives ability to remove unused objects.

marceloboeira · 2018-01-15T15:06:32Z

My two cents:

My experience is similar to @savar

On normal operation the server gets stable on memory usage, distributed across 8 pods.

e.g.: (memory over an hour)

However, over time, even though we have almost no requests during the night, the memory consumption doesn't reduce, which leads me to believe the GC is not working properly or something like that. I only see a memory reduction if I force a recreation of the machines, which also happens during deployments.

e.g.: (deployment)

heynemann · 2018-02-05T18:03:52Z

@marceloboeira Have you tried with the new release? It should improve. Please let me know if you do.

marceloboeira · 2018-02-05T21:35:59Z

@heynemann I did, I also set the custom GC and now it’s better.

_{Sent with GitHawk}

heynemann · 2018-02-05T23:54:57Z

That`s good to know! :)

thumbor#753

heynemann · 2022-01-22T18:59:11Z

Closing this issue since we had no activity for a while in the interest of keeping our Issues list leaner. If this is something we still need to pursue, please reopen and I'll look into it. Thanks!

damaestro added a commit to damaestro/thumbor that referenced this issue Jul 19, 2016

Add gc.collect() for benchmark testing of thumbor#753

de26833

damaestro referenced this issue in damaestro/thumbor Jul 19, 2016

Test gc for reads.

f6cba5e

mvdklip added a commit to nrcmedia/thumbor that referenced this issue Oct 20, 2016

Temp patch for memory usage explosion issue

b3d54b1

thumbor#753

mvdklip added a commit to nrcmedia/thumbor that referenced this issue Dec 2, 2016

Improved patch for memory usage explosion issue

2d0c6c4

thumbor#753

mvdklip added a commit to nrcmedia/thumbor that referenced this issue Dec 13, 2016

Thumbor 6.x patch for memory usage explosion issue

59b0595

thumbor#753

mvdklip mentioned this issue Dec 14, 2016

Fixed unhandled "'NoneType' object has no attribute 'mode'" detector exceptions. #812

Closed

SergioJorge self-assigned this Jun 22, 2017

SergioJorge added a commit that referenced this issue Jun 22, 2017

Solved problem of memory usage explosion #753

4bb3da9

SergioJorge closed this as completed Jun 23, 2017

SergioJorge reopened this Jun 23, 2017

kkopachev mentioned this issue Nov 3, 2017

Fix Context object memory leak caused by circular references #1003

Merged

Kmaschta mentioned this issue Jan 5, 2018

6.4.0 release status #1023

Closed

9 tasks

kkopachev mentioned this issue Mar 7, 2018

Circular references when creating Botocore objects thumbor-community/aws#109

Closed

iksteen pushed a commit to nrcmedia/thumbor that referenced this issue Mar 13, 2019

Thumbor 6.x patch for memory usage explosion issue

17b78c9

thumbor#753

heynemann closed this as completed Jan 22, 2022

Memory Usage Explosion #753

Memory Usage Explosion #753

Comments

kuntoaji commented Jun 13, 2016

Thumbor request URL

Expected behaviour

Actual behaviour

Operating system

thumbor.conf

matclayton commented Jun 14, 2016

lexx27 commented Jun 19, 2016

damaestro commented Jul 11, 2016

kkopachev commented Jul 11, 2016

damaestro commented Jul 16, 2016

damaestro commented Jul 19, 2016

Testing methodology

Results

JawsomeJason commented Sep 12, 2016

damaestro commented Sep 13, 2016

JawsomeJason commented Sep 20, 2016

mvdklip commented Oct 20, 2016

philipbjorge commented Nov 23, 2016 • edited

mvdklip commented Nov 24, 2016 • edited

okor commented Mar 3, 2017

damaestro commented Mar 7, 2017

okor commented Mar 7, 2017

okor commented Mar 7, 2017

scorphus commented Mar 7, 2017 • edited

damaestro commented Mar 7, 2017

okor commented Mar 8, 2017

damaestro commented Mar 8, 2017

okor commented Mar 8, 2017

okor commented Mar 8, 2017

damaestro commented Mar 8, 2017 • edited

mvdklip commented Mar 8, 2017

okor commented Apr 10, 2017

savar commented Apr 10, 2017 • edited

okor commented Apr 10, 2017

SergioJorge commented Jun 23, 2017

kkopachev commented Jun 23, 2017

SergioJorge commented Jun 23, 2017

SergioJorge commented Jun 23, 2017

matclayton commented Jun 23, 2017 via email

SergioJorge commented Jun 23, 2017

kkopachev commented Jun 23, 2017

matclayton commented Jun 26, 2017

masom commented Jun 27, 2017

aabhassharma commented Aug 23, 2017

matclayton commented Aug 23, 2017 via email

kkopachev commented Nov 2, 2017

kkopachev commented Nov 3, 2017

marceloboeira commented Jan 15, 2018

heynemann commented Feb 5, 2018

marceloboeira commented Feb 5, 2018

heynemann commented Feb 5, 2018

heynemann commented Jan 22, 2022

philipbjorge commented Nov 23, 2016 •

edited

mvdklip commented Nov 24, 2016 •

edited

scorphus commented Mar 7, 2017 •

edited

damaestro commented Mar 8, 2017 •

edited

savar commented Apr 10, 2017 •

edited