-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Usage Explosion #753
Comments
We're seeing something similar, not confirmed what is causing it, we just shoot and reboot right now. |
I have a similar problem and the actual usage is relatively low |
We are also experiencing this issue. Normal scaling activities normally cycle the affected nodes but see it fairly regularly. We are going to tune down our LRU threshold for redis memory usage to confirm further. We currently have redis tuned to use 80% of available memory, and see a consistent leak from there. Notice the spike to ~80% and then the slow leak. Not all instances become affected by this either. |
There is also fresh bug report in Pillow repo python-pillow/Pillow#2019 |
I'm going to spend some more time downgrading through different updates to try to find the start of this issue. We are regularly seeing mem leaking issues causing scaling events and then manual intervention to remove the maxed out nodes. This shows a slow memory climb and then a burst of new nodes being put into service as the existing nodes' performance degrades due to memory contention: |
Okay, I think we've confirmed that a Testing methodology
Thumbor config:
ResultsWith patch to At least now we have somewhere to look more specifically. The non |
@damaestro I included your patch at damaestro@f6cba5e, and I'm seeing a drop in memory usage by about 60%. Have you considered a PR for this? |
@thejase it's a nasty hack to help track down where the leak is. By no means is it a fix. Something in My suspicion is file handle leaks from PIL or for some reason |
That's a fair response. Thanks for further insight from your experiences thus far. |
Bumped into this while trying to run some jmeter performance tests. Note that I was still running into OOMs even though I was not hitting the
|
I'll add to this, we're currently running with this patch in production and it causes a marked increase in CPU load such that we've had to double the number of worker instances we normally required. |
Improved patch as listed above to run garbage collection less often. Garbage collection is still triggered by finish_request() inside handlers but now it will not run more often than defined by the GC_INTERVAL config variable. The default is to run every 60 seconds or less (depending on the amount of requests coming in). |
@damaestro Did you figure out at what version of Thumbor/Pillow this problem began to appear? Thanks |
@okor no, I didn't. I unfortunately got distracted and just took the CPU hit and use my patch. I'm glad to see you taking interest in this. It's been a long-standing issue. |
@damaestro So you did actually see an increase in CPU? In my local/staging/production I didn't actually notice any increase in CPU. Could just be our setup I guess. @mvdklip Did you notice a significant change is CPU usage after applying your patch? |
@scorphus What would you think of releasing an official patch for this issue? If we can't find a better solution I think it would be a practical decision to include some variant of the manual GC. And as expectations go I think people would in large part be happier to give up a little CPU than to have thumbor nodes swapping/crashing/etc. I deployed a patch yesterday on our cluster, to one of our nodes. Here are the results: Before the deploy, the large changes in memory usage were from manual restarts : / |
@okor: I think we can include these patches in the upcoming release. I'm reviewing all changes since last release, 6.2.1, and will review this as well. |
@okor nothing terrible, but there is some overhead. It would be nice to find a more appropriate place to do the |
@damaestro I am certainly open to suggestions. Did you have somewhere else in mind? I'm honestly not sure where would be best and just used the location that @mvdklip used nrcmedia@59b0595#diff-96c1f5e9855ad12779e42ade3a0c99e1R140 |
Likely the path of least resistance is to have a tunable, such as the proposed Where I hesitate is that this was introduced at some point. This is a side-effect of a bug in another location so I've been waiting for the actual bug to be identified. However, this would not be the first time software needs to be defensive due to an external defect. |
Just for fun, I dug a little deeper and it's a bit weirder than I thought it would be. According to the gc module there is no leak. All objects are "collectable". Logging objects before/at execution/after manual gc, everything looks great. The only behavior that is even slightly suspicious is when I hit Thumbor with a lot of concurrent requests the number of objects which can and will be gc'ed grows more than it would if I just sent 1 request at a time. For debugging purpose, I put the gc at the end of the If I was to suspect anything ... I bet the changes between thumbor ~ 5.2 and ~ 6 ... which if I recall were in part "more async/seriously not blocking this time" are possibly just giving the illusion that there is a leak. And perhaps the garbage collector just isn't tuned by default in a way that the newer, less blocking Thumbor needs. I think this is what I will believe unless I see evidence to the contrary. I'll keep on eye on Thumbor mem/cpu in production over the next couple of weeks. If my theory is correct then Thumbor should stay fairly flat. If anyone else has info / stats to share, please don't be shy : ) |
@damaestro I super appreciate the suggestions and feedback 😄
I'm not sure that's actually the case. If you are referring to the Pillow issue in which unclosed files were raising an error/warning that was filling up some buffer (lol that bug, ping pong between parties) ... I don't think it can be the case for me at least. I tried to replicate the problem quite a few times using the same packages as production in several small scripts. I saw no indiction this was an issue with Python 2.7 and Pillow 3.4.1/3.4.2 ... and one of the Pillow maintainers was only able to reproduce the issue with Python 3+. So as far as I can tell this is on Thumbor. I'm happy to be proven wrong of course. |
@okor Just trying to help. Have kept my eye on this for a while so I'm glad you are engaged with figuring this one out, I ran out of steam. I was not suggesting where the memory hog is located (inside or outside of Thumbor), just more that we are discussing a portion of code that I don't think is causing the issue. It goes back to the question of "where is the best place to force a garage collection?". It does seem that we have a delta to look at, that being Thumbor 5.x to 6.x. Where this gets challenging is the external dependencies and build deps. Someone with time should be able to increment through a known non-memory-issue version and just keep testing with the methodologies proposed here. Where I would start would be the last known version that does not show this memory profile and rather than upgrading Thumbor, upgrading everything it depends on. If we can validate the old version works with all the latest deps (yes, I assuming we'd have to backport certain api/abi interfaces, but this could be managed)... we can at least isolate this is something in Thumbor or not. Minimally, we'd find that a backported patch is the cause. |
@okor Yes, memory usage dropped (and stabilized) significantly after applying the patch. |
People on this thread are encouraged to leave comments on that PR ^ |
Hi, guys. This problem will be fixed in version 6.3.3. |
@SergioJorge how was it fixed? |
In PIL Engine the module warnings is called for each request. Thats make the list of warnings grow to infinity. |
Sorry to close this issue. |
Happy to, but it'll likely be Monday for me, any idea when 6.3.3 will ship?
…On 23 June 2017 at 19:54, Sergio Jorge ***@***.***> wrote:
Sorry to close this issue.
I'm still not very accustomed to this development flow.
Can you test to see if that solves your problems?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#753 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHtCjH5gQSdSYh3mJWzicySmGojL8NCks5sHAnjgaJpZM4I0L11>
.
--
--
Matthew Clayton | Co-Founder
Mixcloud Limited
mixcloud http://www.mixcloud.com/mat
twitter http://www.twitter.com/matclayton
email mat@mixcloud.com
mobile +44 7872007851
skype matclayton
|
Let's wait for the community to see if this solution solves this problem. |
That change definitely helps with memory usage, but I'm not sure if that was the cause for original issue. import gc
import warnings
import resource
for i in range(1000000):
if (i % 1000 == 0):
if (i % 10000 == 0):
collected = gc.collect()
print ('Collected %s' % collected)
print ("Used %d" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
warnings.simplefilter("error", Image.DecompressionBombWarning) rss grows over time, each loop becomes slower then previous, but gc collects 0 objects and does not make any difference. Eg removing gc.collect does not change anything. |
We've tested this in production (100+ instances) and the RAM usage looks better, but not perfect yet! We see processes going upto 500Meg of RAM, where our OOM killer kicks in and reboot them. This definitely seems to have slowed down a lot with this commit, but its still happening a bit. |
https://github.com/stackimpact/stackimpact-python might be useful to test what is happening |
Any progress here? We run a trivial cluster (20 instances) so I'm not sure if we're the best load testers, but can setup some benches if needed. |
We're running a few hundred nodes and are still seeing problems
(sha:2e9b9095b5413eb6108030f67705f992236227dc),
it does however seem better, but still not fixed.
Mat
…On 23 August 2017 at 01:45, Aabhas Sharma ***@***.***> wrote:
Any progress here? We run a trivial cluster (20 instances) so I'm not sure
if we're the best load testers, but can setup some benches if needed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#753 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHtCpDZ2vEXXEFplS4vgPcfx-yWAun7ks5sa3ZAgaJpZM4I0L11>
.
--
--
Matthew Clayton | Co-Founder
Mixcloud Limited
mixcloud http://www.mixcloud.com/mat
twitter http://www.twitter.com/matclayton
email mat@mixcloud.com
mobile +44 7872007851
skype matclayton
|
I dig around with disabled garbage collector and TestRequestHandler which runs # server.py
import gc
gc.disable()
gc.set_debug(gc.DEBUG_LEAK) # app.py
class TestHandler(BaseHandler):
def get(self):
self.write('{} objects collected'.format(gc.collect()))
class HealthHandler(ContextHandler): # ContextHandler is important here
@tornado.web.asynchronous
def get(self):
self.finish('works')
# handlers added like this:
(r'/health', HealthHandler, {'context': self.context}),
(r'/test', TestHandler), And ran curl self.context.filters_factory = None
self.context.metrics = None
self.context.modules = None
self.context.thread_pool = None
self.context.request = None
self.context.transformer = None
self.context = None and that solved the issue for my simple The problem with context is that thumbor creates Context object when starts Server, then for each It all could be solved by clearing self.context in on_finish method of request handler. But... When result_storage are in use, request handler, when done with image operations, serves result image to the client, finishes request (which calls on_finish method) and then schedules task on IOLoop to store result bytes in result storage. If I clear So, this shows couple issues. First is that thumbor creates Context (which initializes modules, config and such) and then ContextHandler creates it's own Context. Not a big deal in terms of leaks, but weird that it does re-create same objects again, which does not necessarily depend on request-specific parameters. Not quite sure how to solve that interleaved references issue. |
Turns out, that I can't remove |
My two cents: My experience is similar to @savar On normal operation the server gets stable on memory usage, distributed across 8 pods. e.g.: (memory over an hour) However, over time, even though we have almost no requests during the night, the memory consumption doesn't reduce, which leads me to believe the GC is not working properly or something like that. I only see a memory reduction if I force a recreation of the machines, which also happens during deployments. e.g.: (deployment) |
@marceloboeira Have you tried with the new release? It should improve. Please let me know if you do. |
@heynemann I did, I also set the custom GC and now it’s better. Sent with GitHawk |
That`s good to know! :) |
Closing this issue since we had no activity for a while in the interest of keeping our Issues list leaner. If this is something we still need to pursue, please reopen and I'll look into it. Thanks! |
We are experiencing memory usage explosion on Ubuntu 14.04.1 LTS with Thumbor v6.0.1 and Pillow 3.2.0. Our memory usage can hit ~97%. Then it will return to ~60% memory usage and after some time it will hit again ~97% memory usage.
After some investigation, there is possility that
resize
method inthumbor/engines/pil.py
is the root cause. If we rungc.collect()
inresize
method, our memory usage is never hit 97% again.Thumbor request URL
Sample url
http://localhost:7001/unsafe/1920x1080/filename.jpg
Script to reproduce
Expected behaviour
Memory usage should below 90%
Actual behaviour
Memory usage hit ~97%
Operating system
14.04.1 LTS
Thumbor v6.0.1
Pillow 3.2.0
thumbor.conf
The text was updated successfully, but these errors were encountered: