Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage Explosion #753

Closed
kuntoaji opened this issue Jun 13, 2016 · 45 comments
Closed

Memory Usage Explosion #753

kuntoaji opened this issue Jun 13, 2016 · 45 comments
Assignees

Comments

@kuntoaji
Copy link

We are experiencing memory usage explosion on Ubuntu 14.04.1 LTS with Thumbor v6.0.1 and Pillow 3.2.0. Our memory usage can hit ~97%. Then it will return to ~60% memory usage and after some time it will hit again ~97% memory usage.

After some investigation, there is possility that resize method in thumbor/engines/pil.py is the root cause. If we run gc.collect() in resize method, our memory usage is never hit 97% again.

Thumbor request URL

Sample url http://localhost:7001/unsafe/1920x1080/filename.jpg

Script to reproduce

#! /usr/bin/env ruby

require 'open-uri'

@total = 0
count = 100000

(5000..(5000+count)).each do |i|
  size = 2000 + rand(2000)
  wxh = [size, size].join('x')
  url = "http://localhost:7001/unsafe/#{wxh}/filename.jpg"
  start_time = Time.now
  open url
  end_time = Time.now
  duration = end_time - start_time
  @total += duration
  puts "got: #{url} in #{duration.round(2)}"
end

puts "That took: #{@total.round(2)} seconds"
puts "Average #{(@total / count).round(2)} per image"

Expected behaviour

Memory usage should below 90%

Actual behaviour

Memory usage hit ~97%

Operating system

14.04.1 LTS
Thumbor v6.0.1
Pillow 3.2.0

thumbor.conf

MAX_AGE = 31536000
ALLOW_ANIMATED_GIFS = False
LOADER = 'thumbor.loaders.file_loader'

STORAGE = 'thumbor.storages.mixed_storage'
SECURITY_KEY = 'example'

ALLOW_UNSAFE_URL = True
FILE_LOADER_ROOT_PATH = '/vagrant/images'

REDIS_STORAGE_SERVER_HOST = 'example.com'
REDIS_STORAGE_SERVER_PORT = 6379
REDIS_STORAGE_SERVER_DB = 0
REDIS_STORAGE_SERVER_PASSWORD = None

MIXED_STORAGE_CRYPTO_STORAGE = 'tc_redis.storages.redis_storage'
MIXED_STORAGE_DETECTOR_STORAGE = 'tc_redis.storages.redis_storage'

DETECTORS = [
   'thumbor.detectors.face_detector',
   'thumbor.detectors.profile_detector',
   'thumbor.detectors.glasses_detector',
   'thumbor.detectors.feature_detector'
]
@matclayton
Copy link

We're seeing something similar, not confirmed what is causing it, we just shoot and reboot right now.

@lexx27
Copy link

lexx27 commented Jun 19, 2016

I have a similar problem and the actual usage is relatively low

@damaestro
Copy link

We are also experiencing this issue. Normal scaling activities normally cycle the affected nodes but see it fairly regularly. We are going to tune down our LRU threshold for redis memory usage to confirm further.

screenshot from 2016-07-11 12-22-36

We currently have redis tuned to use 80% of available memory, and see a consistent leak from there. Notice the spike to ~80% and then the slow leak. Not all instances become affected by this either.

@kkopachev
Copy link
Member

There is also fresh bug report in Pillow repo python-pillow/Pillow#2019

@damaestro
Copy link

I'm going to spend some more time downgrading through different updates to try to find the start of this issue. We are regularly seeing mem leaking issues causing scaling events and then manual intervention to remove the maxed out nodes. This shows a slow memory climb and then a burst of new nodes being put into service as the existing nodes' performance degrades due to memory contention:

screenshot from 2016-07-15 19-25-05

damaestro added a commit to damaestro/thumbor that referenced this issue Jul 19, 2016
damaestro referenced this issue in damaestro/thumbor Jul 19, 2016
@damaestro
Copy link

Okay, I think we've confirmed that a gc.collect() does stabilize memory usage, but not in the resize method.

Testing methodology

  • randomly generated resize request for one image for 1 to 3000px WxH
  • siege -b -i against those generated urls
  • three AWS t2.medium instances behind an ELB running thumbor in a docker container (4G of RAM per instance, build log: https://paste.fedoraproject.org/392667/96725214/)
  • redis caching origin image requests (1G LRU, but only one image source being requested)
  • two thumbor processes running 8 threads balanced via nginx
  • thumbor 6.1.2 vs ~master plus damaestro@f6cba5e

Thumbor config:

ALLOWED_SOURCES = ['XXX_ORIGIN_XXX']
OPTIMIZERS = ['thumbor.optimizers.jpegtran']
SENTRY_DSN_URL = 'XXX_SENTRY_DSN_XXX'
STORAGE = 'tc_redis.storages.redis_storage'
REDIS_STORAGE_IGNORE_ERRORS = True
REDIS_STORAGE_SERVER_PORT = 6379
REDIS_STORAGE_SERVER_HOST = 'localhost'
REDIS_STORAGE_SERVER_DB = 0
REDIS_STORAGE_SERVER_PASSWORD = None
ENGINE_THREADPOOL_SIZE = 8
# Custom error handling is currently broken.
#USE_CUSTOM_ERROR_HANDLING = True
ERROR_HANDLER_MODULE = 'thumbor.error_handlers.sentry'
MAX_WIDTH = 3000
MAX_HEIGHT = 3000
MAX_SOURCE_SIZE = 3000
QUALITY = 90

Results

Without gc.collect():
screenshot from 2016-07-19 16-22-37

With patch to gc.collect() in Engine.read() from the pil interface:
screenshot from 2016-07-19 15-51-09

At least now we have somewhere to look more specifically. The non gc.collect() would eventually run out of memory, we just ended the test within the same timeframe as the patched test.

@JawsomeJason
Copy link

@damaestro I included your patch at damaestro@f6cba5e, and I'm seeing a drop in memory usage by about 60%. Have you considered a PR for this?

@damaestro
Copy link

@thejase it's a nasty hack to help track down where the leak is. By no means is it a fix. Something in Engine.read() is leaking and that is all this attempts to show. I am having to run this patch otherwise we are just constantly OOMing workers and auto-scaling. I figured others are running into the same issue so a quick patch to keep things working while the upstream devs look into this was my plan.

My suspicion is file handle leaks from PIL or for some reason img_buffer (a local BytesIO) is leaking and just needs a del or some other magic. I've tested none of this beyond basic profiling and forcing gc.collect().

@JawsomeJason
Copy link

That's a fair response. Thanks for further insight from your experiences thus far.

@mvdklip
Copy link
Contributor

mvdklip commented Oct 20, 2016

Bumped into this while trying to run some jmeter performance tests. Note that I was still running into OOMs even though I was not hitting the Engine.read() code path (doing meta requests with file storage enabled). Patched it temporarily like so:

diff --git a/thumbor/handlers/__init__.py b/thumbor/handlers/__init__.py
index 61ea916..c18ffdc 100644
--- a/thumbor/handlers/__init__.py
+++ b/thumbor/handlers/__init__.py
@@ -8,6 +8,7 @@
 # http://www.opensource.org/licenses/mit-license
 # Copyright (c) 2011 globo.com timehome@corp.globo.com

+import gc
 import sys
 import functools
 import datetime
@@ -311,6 +312,10 @@ class BaseHandler(tornado.web.RequestHandler):
                 if should_store:
                     self._store_results(context, results)

+                collected = gc.collect()
+                if collected > 10:
+                    logger.warn('Garbage collector: collected %d objects.' % collected)
+
         self.context.thread_pool.queue(
             operation=functools.partial(self._load_results, context),
             callback=inner,

mvdklip added a commit to nrcmedia/thumbor that referenced this issue Oct 20, 2016
@philipbjorge
Copy link

philipbjorge commented Nov 23, 2016

I'll add to this, we're currently running with this patch in production and it causes a marked increase in CPU load such that we've had to double the number of worker instances we normally required.

@mvdklip
Copy link
Contributor

mvdklip commented Nov 24, 2016

Improved patch as listed above to run garbage collection less often. Garbage collection is still triggered by finish_request() inside handlers but now it will not run more often than defined by the GC_INTERVAL config variable. The default is to run every 60 seconds or less (depending on the amount of requests coming in).

@okor
Copy link
Member

okor commented Mar 3, 2017

@damaestro Did you figure out at what version of Thumbor/Pillow this problem began to appear? Thanks

@damaestro
Copy link

@okor no, I didn't. I unfortunately got distracted and just took the CPU hit and use my patch. I'm glad to see you taking interest in this. It's been a long-standing issue.

@okor
Copy link
Member

okor commented Mar 7, 2017

@damaestro So you did actually see an increase in CPU? In my local/staging/production I didn't actually notice any increase in CPU. Could just be our setup I guess.

@mvdklip Did you notice a significant change is CPU usage after applying your patch?

@okor
Copy link
Member

okor commented Mar 7, 2017

@scorphus What would you think of releasing an official patch for this issue? If we can't find a better solution I think it would be a practical decision to include some variant of the manual GC. And as expectations go I think people would in large part be happier to give up a little CPU than to have thumbor nodes swapping/crashing/etc.

I deployed a patch yesterday on our cluster, to one of our nodes. Here are the results:
thumbor_memory_post_gc

Before the deploy, the large changes in memory usage were from manual restarts : /

@scorphus
Copy link
Member

scorphus commented Mar 7, 2017

@okor: I think we can include these patches in the upcoming release. I'm reviewing all changes since last release, 6.2.1, and will review this as well.

@damaestro
Copy link

@okor nothing terrible, but there is some overhead. It would be nice to find a more appropriate place to do the gc.collect() more dynamically, less often.

@okor
Copy link
Member

okor commented Mar 8, 2017

@damaestro I am certainly open to suggestions. Did you have somewhere else in mind?

I'm honestly not sure where would be best and just used the location that @mvdklip used nrcmedia@59b0595#diff-96c1f5e9855ad12779e42ade3a0c99e1R140

@damaestro
Copy link

Likely the path of least resistance is to have a tunable, such as the proposed GC_INTERVAL configuration option with a high default. If we got fancy, it would need to be based on a governor that cares about active memory vs available. Scheduling something for now based on a configurable seems like a good short-term compromise.

Where I hesitate is that this was introduced at some point. This is a side-effect of a bug in another location so I've been waiting for the actual bug to be identified. However, this would not be the first time software needs to be defensive due to an external defect.

@okor
Copy link
Member

okor commented Mar 8, 2017

Just for fun, I dug a little deeper and it's a bit weirder than I thought it would be. According to the gc module there is no leak. All objects are "collectable". Logging objects before/at execution/after manual gc, everything looks great.

The only behavior that is even slightly suspicious is when I hit Thumbor with a lot of concurrent requests the number of objects which can and will be gc'ed grows more than it would if I just sent 1 request at a time. For debugging purpose, I put the gc at the end of the finish_request method. So it makes sense ... that some objects will pile up from multiple un-finished requests .. since Tornado is using a non-blocking event loop.

If I was to suspect anything ... I bet the changes between thumbor ~ 5.2 and ~ 6 ... which if I recall were in part "more async/seriously not blocking this time" are possibly just giving the illusion that there is a leak. And perhaps the garbage collector just isn't tuned by default in a way that the newer, less blocking Thumbor needs. I think this is what I will believe unless I see evidence to the contrary.

I'll keep on eye on Thumbor mem/cpu in production over the next couple of weeks. If my theory is correct then Thumbor should stay fairly flat. If anyone else has info / stats to share, please don't be shy : )

@okor
Copy link
Member

okor commented Mar 8, 2017

@damaestro I super appreciate the suggestions and feedback 😄

This is a side-effect of a bug in another location so I've been waiting for the actual bug to be identified.

I'm not sure that's actually the case. If you are referring to the Pillow issue in which unclosed files were raising an error/warning that was filling up some buffer (lol that bug, ping pong between parties) ... I don't think it can be the case for me at least.

I tried to replicate the problem quite a few times using the same packages as production in several small scripts. I saw no indiction this was an issue with Python 2.7 and Pillow 3.4.1/3.4.2 ... and one of the Pillow maintainers was only able to reproduce the issue with Python 3+. So as far as I can tell this is on Thumbor. I'm happy to be proven wrong of course.

@damaestro
Copy link

damaestro commented Mar 8, 2017

@okor Just trying to help. Have kept my eye on this for a while so I'm glad you are engaged with figuring this one out, I ran out of steam. I was not suggesting where the memory hog is located (inside or outside of Thumbor), just more that we are discussing a portion of code that I don't think is causing the issue. It goes back to the question of "where is the best place to force a garage collection?". It does seem that we have a delta to look at, that being Thumbor 5.x to 6.x. Where this gets challenging is the external dependencies and build deps.

Someone with time should be able to increment through a known non-memory-issue version and just keep testing with the methodologies proposed here. Where I would start would be the last known version that does not show this memory profile and rather than upgrading Thumbor, upgrading everything it depends on. If we can validate the old version works with all the latest deps (yes, I assuming we'd have to backport certain api/abi interfaces, but this could be managed)... we can at least isolate this is something in Thumbor or not. Minimally, we'd find that a backported patch is the cause. </$0.02>

@mvdklip
Copy link
Contributor

mvdklip commented Mar 8, 2017

@okor Yes, memory usage dropped (and stabilized) significantly after applying the patch.

@okor
Copy link
Member

okor commented Apr 10, 2017

People on this thread are encouraged to leave comments on that PR ^

@savar
Copy link
Contributor

savar commented Apr 10, 2017

image

This is the memory usage if using OpenCV Engine. So I guess it is not pillow related?

(you see 8 thumbor instances and the drops are restarts during a deploy)

@okor
Copy link
Member

okor commented Apr 10, 2017

@savar yeah, I couldn't find any correlation with Pillow as I explained in the PR #905

@SergioJorge
Copy link
Member

Hi, guys.

This problem will be fixed in version 6.3.3.
Thank you all !

@kkopachev
Copy link
Member

@SergioJorge how was it fixed?

@SergioJorge
Copy link
Member

@kkopachev

In PIL Engine the module warnings is called for each request. Thats make the list of warnings grow to infinity.
You can see the code here.

@SergioJorge SergioJorge reopened this Jun 23, 2017
@SergioJorge
Copy link
Member

Sorry to close this issue.
I'm still not very accustomed to this development flow.
Can you test to see if that solves your problems?

@matclayton
Copy link

matclayton commented Jun 23, 2017 via email

@SergioJorge
Copy link
Member

@matclayton

Let's wait for the community to see if this solution solves this problem.

@kkopachev
Copy link
Member

That change definitely helps with memory usage, but I'm not sure if that was the cause for original issue.
Some people report above that the issue also happens when they use openCV engine, which probably does not involve loading pil engine (not sure here). Other reason, is that there are reports that manual gc run fixes (mitigates) the issue, so I created this little test script to reproduce the issue and see if gc can help:

import gc
import warnings
import resource

for i in range(1000000):
    if (i % 1000 == 0):
        if (i % 10000 == 0):
            collected = gc.collect()
            print ('Collected %s' % collected)
        print ("Used %d" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    warnings.simplefilter("error", Image.DecompressionBombWarning)

rss grows over time, each loop becomes slower then previous, but gc collects 0 objects and does not make any difference. Eg removing gc.collect does not change anything.

@matclayton
Copy link

We've tested this in production (100+ instances) and the RAM usage looks better, but not perfect yet! We see processes going upto 500Meg of RAM, where our OOM killer kicks in and reboot them. This definitely seems to have slowed down a lot with this commit, but its still happening a bit.

@masom
Copy link
Contributor

masom commented Jun 27, 2017

https://github.com/stackimpact/stackimpact-python might be useful to test what is happening

@aabhassharma
Copy link

Any progress here? We run a trivial cluster (20 instances) so I'm not sure if we're the best load testers, but can setup some benches if needed.

@matclayton
Copy link

matclayton commented Aug 23, 2017 via email

@kkopachev
Copy link
Member

I dig around with disabled garbage collector and TestRequestHandler which runs gc.collect() and outputs number of objects collected and found interesting thing.
I created 2 handlers

# server.py
import gc
gc.disable()
gc.set_debug(gc.DEBUG_LEAK)
# app.py
class TestHandler(BaseHandler):
    def get(self):
        self.write('{} objects collected'.format(gc.collect()))

class HealthHandler(ContextHandler): # ContextHandler is important here
    @tornado.web.asynchronous
    def get(self):
        self.finish('works')

# handlers added like this:
(r'/health', HealthHandler, {'context': self.context}),
(r'/test', TestHandler),

And ran curl curl localhost:8888/health - it showed some objects collected first time, then 0 for subsequent requests.
Then I ran curl localhost:8888/test and right after curl localhost:8888/health - it showed that a lot of objects were collected.
That seemed weird, why they are not removed, if there should not be any references to them?
So in ContextHandler::on_finish I added few lines:

        self.context.filters_factory = None
        self.context.metrics = None
        self.context.modules = None
        self.context.thread_pool = None
        self.context.request = None
        self.context.transformer = None
        self.context = None

and that solved the issue for my simple /test request.

The problem with context is that thumbor creates Context object when starts Server, then for each ContextHandler descendant server passes Server's Context object. Then ContextHandler in initialize picks some values from Server's Context object and creates new Context, which also has a reference to request_handler.
Now handler's context has reference to request_handler and request_handler has reference to context. Then Context.modules contains list of module objects, each of them contains reference to context.

It all could be solved by clearing self.context in on_finish method of request handler. But... When result_storage are in use, request handler, when done with image operations, serves result image to the client, finishes request (which calls on_finish method) and then schedules task on IOLoop to store result bytes in result storage. If I clear self.context in on_finish, result_storage module, which holds a reference to the same context, will fail on saving result data, since it uses context.request.url as a filepath where to store the data. context object will be cleared already at this point.

So, this shows couple issues. First is that thumbor creates Context (which initializes modules, config and such) and then ContextHandler creates it's own Context. Not a big deal in terms of leaks, but weird that it does re-create same objects again, which does not necessarily depend on request-specific parameters.
Second is that this context object has circular references to many objects inside.

Not quite sure how to solve that interleaved references issue.

@kkopachev
Copy link
Member

Turns out, that I can't remove self.context.request since request is used in result_storage, but I can unset self.context.request.engine which holds reference back to context. That allows result storage still figure out request parameters and gives ability to remove unused objects.

@Kmaschta Kmaschta mentioned this issue Jan 5, 2018
9 tasks
@marceloboeira
Copy link
Contributor

My two cents:

My experience is similar to @savar

On normal operation the server gets stable on memory usage, distributed across 8 pods.

e.g.: (memory over an hour)

screen shot 2018-01-15 at 15 03 33

screen shot 2018-01-15 at 15 03 22

However, over time, even though we have almost no requests during the night, the memory consumption doesn't reduce, which leads me to believe the GC is not working properly or something like that. I only see a memory reduction if I force a recreation of the machines, which also happens during deployments.

e.g.: (deployment)

screen shot 2018-01-15 at 16 01 49

@heynemann
Copy link
Member

@marceloboeira Have you tried with the new release? It should improve. Please let me know if you do.

@marceloboeira
Copy link
Contributor

@heynemann I did, I also set the custom GC and now it’s better.

Sent with GitHawk

@heynemann
Copy link
Member

That`s good to know! :)

@heynemann
Copy link
Member

Closing this issue since we had no activity for a while in the interest of keeping our Issues list leaner. If this is something we still need to pursue, please reopen and I'll look into it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests