Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using short lived threads #651

Open
alan opened this issue Oct 27, 2017 · 17 comments
Open

Memory leak when using short lived threads #651

alan opened this issue Oct 27, 2017 · 17 comments
Labels

Comments

@alan
Copy link

alan commented Oct 27, 2017

In the last few days I've been chasing a memory leak that was introduced as part of the fix of #640. The memory leak of the fix is actually discussed in the conversation about the fix and we've run into it.

We use Excon to keep a persistent connection to an API in a web application that runs with Puma as the server. The Puma configuration allowed for a variable number of threads between a minimum and maximum (this is how Puma runs by default). With the new change the app uses all of the memory on the server and had to be killed.

Before the change the sockets for each connection were cached on the thread variables, so when a Puma thread got destroyed it allowed the Excon sockets to be garbage collected. With the new version the cache is stored on the connection itself. When a thread is destroyed the cache does not clear the sockets that it holds for that thread which causes a leak of Excon sockets.

I'm not sure what is the preferred fix, there are a couple of solutions that I can think of:

I prefer the first option because is simpler and leads to less confusion about how an Excon connection really works. The second solution seems a bit hackish to me.

However I may be missing a use case where the cache is necessary or useful?

@geemus
Copy link
Contributor

geemus commented Oct 27, 2017

Thanks for the detailed report. Maybe it is in fact time to pull the plug on the "automagic" caching across connections and limit it to just caching within the context of a particular connection. I believe that is what 1 means, correct?

The idea with the cache was that it would give you a better default behavior even if you were lazy about reusing connections (because initially at least, people were not managing connection pools/etc). That is much more a nice-to-have than a requirement though, especially now that we have seen clear downsides for advanced users that seem to greatly outweigh the potential nominal benefit for beginners. Less advanced usage should still work in this case, though it might be slightly less performant. And there is a clear path to improved performance as well. As such, dropping it seems safe/reasonable enough to me.

As a bit more of an aside, to take this to the next level, I do sometimes wonder if we shouldn't consider providing some connection pooling functionality as part of the gem in order to make it easier for advanced users to do advanced things. It may be something where peoples needs differ enough to make it hard for a more general case solution, but I would be interested to hear more about how you are managing your pool and how that's working for you (this issue aside).

/cc @mpalmer (as implementor/discusser of #640, where we discussed the leak possibility and also potentially dropping the connection cache altogether).

@alan
Copy link
Author

alan commented Oct 27, 2017

Thanks for the reply.

Yes, dropping the cache is what I meant. I'm happy to help with the PR if you want.

Setting up a connection pool has been pretty painless thanks to the connection_pool gem which is what Sidekiq uses. You can see how we use it here. I guess its suitability depends on what features you want from an embedded connection pool... An issue that we came across was handling EOFError's, but we'd have to handle these even if we weren't using a connection pool I think.

@mpalmer
Copy link
Contributor

mpalmer commented Oct 27, 2017

I live by, "first make it work, then make it fast". I think there's enough cases where things aren't working, so I'm in favour of making things a bit slower in order to get back to "working".

@alan
Copy link
Author

alan commented Oct 30, 2017

By removing the cache it will make Excon not thread-safe by default. This is quite a bit change, you are you OK with this?

I think it is as long as developers are made aware that they need to manage concurrency explicitly.

There is another alternative that is to use thread variables as it was done before but not let excon connections share the socket per thread. It was said in the other issue that this would leak memory but I don't see how unless threads are being leaked in the first place.

The downsides I see with this is the number of sockets open when using many threads and that this behaviour is not obvious to the user when using Excon. There is no upper bound like there is when using a connection pool.

@mpalmer
Copy link
Contributor

mpalmer commented Oct 30, 2017

Removing connection caching does not imply doing nothing to support multi-threaded use, if that is a feature Excon wants to support. For myself, I'm ambivalent on the subject of magic, hidden, multi-threaded support in libraries, because the contortions required can often cause as many problems as they solve (cf. #640), and naive approaches to solving the problem can preclude reasonable uses (what if you want to allow a single connection to be used by multiple threads, with appropriate application-level coordination?).

My take is that if you're using threads, you're already setting yourself up to manage many shared resources, so as long as Excon clearly says in the documentation, "connection objects are not thread-safe, and must not be concurrently accessed by multiple threads without appropriate coordination", the bases have been sufficiently covered. (Backwards-compatibility concerns are a whoooole other kettle of worms which I'm not going to touch on).

All of this, of course, is just my opinion. I'm not an Excon maintainer, nor do I play one on TV, so at the end of the day the call isn't mine to make. I'd normally keep my thoughts to myself, but @geemus pinged me, so I released the Opinion Kraken. 😀

@geemus
Copy link
Contributor

geemus commented Oct 31, 2017

Thanks for details and opinions. I need to think harder on it when I am more fresh (bit fried at end of my current day). Will try to re-read and make a call soonish. Thanks!

@geemus
Copy link
Contributor

geemus commented Nov 2, 2017

I wonder if maybe thread-local, but non-shared is the right middle ground?

Previously: Thread.current[:_excon_sockets] ||= {}
Currently: @_excon_sockets[Thread.current.object_id] ||= {}

Maybe we should move to something more like this?

Thread.current[Thread.current.object_id] ||= {}

Or, to maybe better namespace and avoid other possible collisions (something I hadn't really fully considered with the earlier change), something like:

Thread.current[:_excon_sockets] ||= {}
Thread.current[:_excon_sockets][Thread.current.object_id] ||= {}

This gets rid of the wonky optimization causing the issues (each connection gets it's own sockets), but I think also separates sockets per thread. If I understand correctly this would also solve the memory leak issue, as when the thread dies it should be able to garbage collect it's thread local variables.

What do you both think? Am I missing anything? Thanks!

@alan
Copy link
Author

alan commented Nov 3, 2017

What do you mean by "non-shared"? Not shared by connections or threads?

I think the Thread.current.object_id key does nothing in this case because the hash is already namespaced by using a thread variable.

Since the socket is referenced in a thread variable other threads shouldn't access that variable. The thread reading that variable will be the same one.

In other words:

Thread.current[:_excon_sockets][Thread.current.object_id] ||= {}

would effectively be the same as:

Thread.current[:_excon_sockets] ||= {}

This would be the same as before the change that introduced the memory leak:

  • fix the memory leak when threads are garbaged collected
  • sockets would be shared between connections in the same thread
  • there would be a socket object created for each thread that tries to use a shared connection object (for example when using a connection pool of size 5 shared between 4 threads would end up creating 4 sockets, currently 20 sockets are created).

Does that sound right? I could be misunderstanding your idea...

@geemus
Copy link
Contributor

geemus commented Nov 3, 2017

I think you are right, I knew what I meant, but apparently I was less on top of it explaining. I was thinking about keying off the object_id of the connection object instead of the thread. Which I guess would look more like:

Thread.current[:_excon_sockets][self.object_id] ||= {}

Does that (hopefully) make more sense?

@alan
Copy link
Author

alan commented Nov 3, 2017

Yes, this would solve the issue of ensuring that a connection uses its own socket, and different threads get their own socket.

The danger is that if there are long lived thread that use short lived connection objects then when the connection object is garbage collected the thread local cache needs to be updated. Otherwise it would cause a memory leak with a different usage pattern.

Another issue, at least for me, is that this is still a surprising default behaviour that a connection ends up have multiple sockets per thread. When used with a connection pool in a threaded environment it means that there are more sockets than necessary, being idle, which increases the chance of the server timing out the connection and causing an EOFError on the following request.

Related to this, we've seen a big reduction on memory when we disabled this cache to reduce the amount of Excon sockets in use (25 to 5):

screen shot 2017-11-03 at 15 05 56

A reduction of about 400MB.

For me this is another reason to have users configure, depending on their needs, how they want to manage the HTTP connections.

Overall for me the issue is that to get an optimal setup with a connection pool or threaded environment one has might have to disable thread_safe_sockets which is undocumented and the name is misleading depending on what one wants to do.

@geemus
Copy link
Contributor

geemus commented Nov 6, 2017

Thanks for the detailed response, I think I followed most of it. One point I was struggling with though, how would a connection end up with multiple sockets per thread with the described implementation?

Also, for the short-lived connection on a long-lived thread case, I presume the problem is that the thread locals wouldn't get collected until the thread terminated?

Thanks for continuing to discuss and talk through this. I'm glad you got a good reduction there, though I agree it is awkward and surprising.

@alan
Copy link
Author

alan commented Nov 6, 2017

Sorry, that was a mistake! I meant multiple sockets across threads.

Yes, that would be the issue. The cache needs to be updated when a connection object is garbaged collected otherwise the sockets will leak. I guess it is the inverse of what happens now which is a leak with short lived threads using long lived objects.

Is there reason for the cache to make connections thread safe or is there another use case?

@geemus
Copy link
Contributor

geemus commented Nov 9, 2017

I was going back in time to try and refresh my memory on the context of how we got to where we are today. I'll try to summarize and link (for my own memory but also to give you context).

I believe when I was implementing it early on, I made it thread-safe by default because it seemed like a good idea in the abstract (if there was a specific issue or use case I didn't find it).

Then, somewhere along the line, people started wanting to integrate with their own managed connection pools. To be cautious, we added it as a non-default option (to avoid possible backwards incompatible breakage).

Which is to say more explicitly. I don't recall exactly why I made them threadsafe to begin with, but the trend has been toward unsafe ever since. Perhaps it's time we just make that larger change (perhaps with a more obvious version bump) and change the default and/or remove the optionality.

Does that help paint a clearer picture? Also, if we remove thread-safety requirements altogether, what does it look like? Do we just use the existing non-threadsafe branch as the default?

@alan
Copy link
Author

alan commented Nov 13, 2017

Yes, that makes sense. I was just curious to know if there was a use case for that feature that I was missing.

To me it makes sense. PersonalIy I would go for the option of removing thread safety completely. Then if there is demand for it the feature can be discussed on how best to implement it depending on the needs.

@alan
Copy link
Author

alan commented Nov 21, 2017

Any update on this?

@geemus
Copy link
Contributor

geemus commented Nov 21, 2017

Not really I'm afraid, got wrapped up in a number of other things and hadn't gotten back to it. Sorry about that. Hopefully I can revisit soon (though with the holiday looming, probably not until after that at least).

@stale
Copy link

stale bot commented Jul 30, 2018

This issue has been automatically marked stale due to inactivity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 30, 2018
@geemus geemus added pinned and removed wontfix labels Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants