Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makara connection pool management is not thread-safe #151

Open
aks opened this issue Mar 14, 2017 · 32 comments
Open

makara connection pool management is not thread-safe #151

aks opened this issue Mar 14, 2017 · 32 comments

Comments

@aks
Copy link

aks commented Mar 14, 2017

The connection pool management is not thread-safe. So, when running makara in thread-intensive code, such as sidekiq, subtle errors creep in.

The fix is not entirely straightforward: AR 4 and 5 has reorganized the connection pool management to be thread-safe, with lots of mutex usage. Makara has none of that, and it's connection pool management is not thread-safe.

In fact, makara uses an array of connections for each pool, which is traversed each time a connection decision is being made. Connections are added up to the maximum connection, but there are no protections against simultaneous access and changes across threads.

In addition, in the master and replica (slave) connection pools, there are connections with different states in each array: blacklisted or not.

We could protect the connection pool array with a semaphore, but traversing an array behind a semaphore is a Bad Idea because it blocks all other threads from traversing that same array. If the semaphore supported read-locking in addition to write-locking then multi-thread traversals would work in parallel, except for adding or removing connections.

However, IMHO, it would be better to have multiple arrays of connection pools, with each kind of connection being queued into a separate pool (array) of connections. So, blacklisted connections would be maintained in a blacklisted connection pool, and the remaining set of available connections would be in the available connection pool. Then, each pool could be managed as a thread-safe queue object.

Makara should be updated to make best use of the latest AR connection pool code, becoming thread-safe in the process.

@bleonard
Copy link
Contributor

Thanks for taking the time to write this up @aks
It reads like a good summary of the issue at hand.

I don't suppose you want to give it a shot? 😄

@aks
Copy link
Author

aks commented Mar 15, 2017

Well, maybe.

I've already forked the makara repo, and have added a hijack on the connection method.

However, that change was insufficient. There are still AR timeouts occurring in our sidekiq processes, which are thread-full, and the stacktraces on those timeouts do not include makara code anywhere.

Whether or not the timeouts and lack-of-thread-awareness in makara are related is still TBD.

So, there appear to be two issues:

  1. makara has incomplete coverage of current AR methods (it doesn't hijack all the necessary methods)
  2. makara does not manage its connection correctly in a thread-full environment

I've reviewed the ActiveRecord 4.2.6 and 5 code connection adapters and connection handling, to get an idea of how hard it would be to update/rewrite makara to become an alternative connection handler for AR.

In the current AR 4.2.6+ or 5 code, each model has a connection handler, which is supposed to determine the appropriate connection using a connection spec. Once a connection is selected, it is cached.

In order to support dynamic query balancing, each model would have to use a dynamic connection handler, which would choose a connection on each query, instead of using a static, cached connection. This appears to require a change to AR itself.

Would you happen to be aware anyone seriously looking into these AR issues? Are you or anyone on your team doing so?

The alternative to using an AR "insert" like makara is a proxy service, like pg_bouncer or pg_pool which could also be reconfigured to perform dynamic load-balancing.

Since we are already using pg_bouncer, it's conceptually simpler and comparatively easier (in theory) to consider updating it to perform load-balancing dynamically, based on the SQL statement and its context.

As you might guess, right now I'm looking for the Path of Least Effort, much as the cat in Heinlein's book was always looking for the "Door Into Summer". I'm really hopeful that you might have a good answer to help us find that door.

@bf4
Copy link
Contributor

bf4 commented Apr 14, 2017

I kind of like how switch_point does it https://github.com/eagletmt/switch_point/blob/master/lib/switch_point/proxy.rb#L22-L33 by making an active record subclass just for getting access to a connection_pool, similar to https://github.com/customink/secondbase/blob/master/lib/second_base/base.rb

or https://github.com/instructure/shackles/blob/master/lib/shackles/connection_handler.rb kind of extends the connection handler

@aks
Copy link
Author

aks commented Apr 14, 2017

The fresh_connection gem (https://github.com/tsukasaoishi/fresh_connection) has an elegant way to insert itself into AR, without having to "hijack" lots of methods.

@jwg2s
Copy link

jwg2s commented Jun 8, 2017

@aks how have you guys mitigated this? We're evaluating makara with https://github.com/ankane/distribute_reads + sidekiq, and in our staging environment seeing that no Postgres data is being collected. My hunch is that this is related to makara's lack of thread safety.

This seems like a deal breaker for anyone using makara, right?

EDIT: I think this is just a reporting issue, what used to show up as Postgres is now showing up as ActiveRecord in NewRelic. Still interested to hear your thoughts on whether makara is really safe to use with sidekiq in a production setting.

@swordfish444
Copy link

@aks This prevented us from adopting Makara. Is there any movement on this? It would be great to see this fixed!

@bleonard
Copy link
Contributor

@jeremy with your knowledge of the AR internals, any thoughts on this?
Have you done anything to fix this issue in your environment?

@rajagopals
Copy link

@aks @jwg2s @bleonard We are evaluating using makara for read-write splitting and came across this thread. Do you have updates/suggestions/mitigation options for this issue?

I don't see other well-maintained gems for real-write splitting in production, anyone has other suggestions that worked well in production environment?

FYI We use AWS Aurora MySQL as our backing store and our application is hosted on Heroku.

@aks
Copy link
Author

aks commented Aug 2, 2018 via email

@dzunk
Copy link

dzunk commented Aug 2, 2018

@rajagopals recently went through the same thing, needed to migrate off of pgpool/pgbouncer and the thread-safety issue ultimately prevented us from adopting Makara.

We wound up using active_record_slave on top of an Aurora Postgres cluster. No issues yet in a multi-threaded production environment, handling ~40k queries per minute. YMMV

@ankane
Copy link
Contributor

ankane commented Sep 20, 2018

Hey @aks, can you explain more about how the issue manifests itself, and how to reproduce it?

@patrykk21
Copy link

Any updates here?

@jeremy
Copy link
Contributor

jeremy commented Oct 10, 2018

@patrykk21 Are you investigating it? Please do share your findings!

@patrykk21
Copy link

@jeremy I tried replicating the issue with both gems on our project locally without success. I'm generating a new clean Rails 5.1 app to have more control over the environment and trying to break it.

Do you have any insight or any specific tests that would break it?

@jeremy
Copy link
Contributor

jeremy commented Oct 12, 2018

@patrykk21 How'd you try to replicate it?

@patrykk21
Copy link

@jeremy Well, using our app with multiple simultaneous requests through sidekiq and some manual tests through threads, expecting to have some timeout errors but it actually never occurred.

@jeremy
Copy link
Contributor

jeremy commented Oct 12, 2018

@patrykk21 Hmm, that would be a matter of getting lucky to see an issue. We'd need to examine the code to see how to get concurrent threads into a state that causes specific problems. Considering you're using Sidekiq and there are other libraries that do suggest they're thread-safe (possibly by leaning more heavily on Active Record's connection handlers), I'd go with an alternative for now.

@tmiller
Copy link

tmiller commented Mar 20, 2019

Has anyone had any luck in reproducing the issues seen? We are also running with a similar configuration (puma / sidekiq). It would be nice to use this, but this issue concerns us. I looked at the alternatives and they do not seem to have a feature I like (failing over to master on high replica lag). If anyone has reproduced it we'd be willing to help resolve the issue.

@pinnymz
Copy link

pinnymz commented Mar 25, 2019

Consider that makara is using Thread.current, and that no Sidekiq middleware is in place to inject the correct context, this may be a thread reuse problem. See the RequestStore project (and the corresponding sidekiq solution).

@pinnymz
Copy link

pinnymz commented Mar 27, 2019

@aks @tmiller can you confirm if this sounds like the issue you're facing?

@tmiller
Copy link

tmiller commented Mar 27, 2019

We were not seeing it, just wary of moving. I plan on setting up some tests to validate that your change fixes it. This would be great if it does because it means we can move forward :).

@pinnymz
Copy link

pinnymz commented Mar 27, 2019

Reading through @aks description of the problem again, I believe my fix doesn't address this. It seems the root that issue is not related to thread reuse, but rather the pooling mechanism itself. My fix is solving a different problem.

@tmiller
Copy link

tmiller commented Mar 28, 2019

Oops, yup you are correct. Well, I guess I'll take a stab at reproducing the pooling issue.

@jeffdoering
Copy link

Is the basic write-up of the problem accurate? This issue generates concern about Makara thread-safety but hasn't reach any clear answer in 2 years.

In fact, makara uses an array of connections for each pool, which is traversed each time a connection decision is being made. Connections are added up to the maximum connection, but there are no protections against simultaneous access and changes across threads.

I believe Makara's @slave and @master pools are private to a single Makara Adapter; aka Active Record connection. This comes from inheriting from Makara::Proxy. As long as Active Record is limiting the connection to a single thread at a time - the Makara data structures inside that connection (including the actual underlying connections provided by real connection adapters) are never access concurrently.

Have I misunderstood? Does Makara actually share underlying pools across different MakaraAdapters and is therefore subject to multi-threading concerns for it's internal data structures?

@pinnymz
Copy link

pinnymz commented Jun 4, 2019

As long as Active Record is limiting the connection to a single thread at a time....

This is the first part of the issue that @aks is describing. ActiveRecord 3 does not limit access or changes to the pool - or any individual connection in the pool - to a single thread.

@jeffdoering
Copy link

As long as Active Record is limiting the connection to a single thread at a time....

This is the first part of the issue that @aks is describing. ActiveRecord 3 does not limit access or changes to the pool - or any individual connection in the pool - to a single thread.

Active Record (versions 4, 5, 6) does limit access to a connection to one thread as long as ActiveRecord connection access patterns are honored by applications (that's critical for safely using most database connection implementations). @aks mentioned this in his initial report.

See here for the model - note this pool has nothing to do with the term "pool" used inside Makara (which is an unfortunate naming collision):

https://github.com/rails/rails/blob/4dcb46182a4aaa57f44f3eb722c1db54fa0ff843/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L48

Given single-thread access to a MakaraAdapter at a time; I don't see any concern that it's internal data structures will be access concurrently.

Can you elaborate on where you think concurrent use is possible assuming application threads that correctly follow ActiveRecord access patterns?

@pinnymz
Copy link

pinnymz commented Jun 13, 2019

@jeffdoering Firstly, I didn't realize AR 3 was off the table for this discussion :) My use case happens to be with a legacy app that is still on AR 3.

Secondly, the problem is with the Makara pool itself (as you noted, this is distinct from AR connection pool). Unlike AR 4+, the Makara pool is managed as a singleton Array that is not thread safe. Individual connections provided by the AR pool (in a thread-safe manner), are then cached and held in the singleton Makara pool in an un-thread-safe manner. From an architectural standpoint, Makara doesn't inject itself downstream of the AR pool, rather it sits side-by-side and delegates calls to it as it sees fit.

@jeffdoering
Copy link

If this all applies equally to AR 3; great. @aks suggested in the original post that AR 3 itself might not be thread-safe in which case Makara thread-safety would be a side-note.

In terms of Makara architecture:

  • It does not get any connections from ActiveRecord::ConnectionPool.
  • It does use ActiveRecord::Base to instantiate connections but those are downstream from the ActiveRecord::ConnectionPool inside of the MakaraAbstractAdapter itself.
  • MakaraAbstractAdapter does delegate heavily to ActiveRecord::ConnectionAdapters::AbstractAdapter; but not to ActiveRecord::ConnectionPool
  • Makara::Pool is not a singleton. It's internal array is not thread-safe but 2x Makara::Pools belong to exactly one MakaraAbstractAdapter. The ActiveRecord::ConnectionAdapters::AbstractAdapter in the MakaraPool are not shared outside of the MakaraAbstractAdapter.

As long as Active Record and the consuming application use the ActiveRecord::ConnectionPool correctly; only one thread should interact with an instance of MakaraAbstractAdapter at a time and the internal data structures of the MakaraAbstractAdapter (including it's Makara::Pool objects) do not need to be thread-safe.

However; I wouldn't be following this issue or the Makara code in such detail if everything worked fine. We are seeing some problematic behavior with Makara that I believe is related to its interaction with the ActiveRecord::ConnectionPool.

It's not specifically a thread-safety issue but rather a problem where Makara "corrupts" the state of the ActiveRecord::ConnectionPool such that threads cannot safely acquire connections from the pool. I'll file a detailed issue shortly.

@philipbjorge
Copy link

philipbjorge commented Jul 12, 2019

@jeffdoering -- Have you filed a detailed issue yet?

We're seeing issues with our ConnectionPool becoming corrupted during database failovers in sidekiq only with makara and we are unable to recover without restarting the servers.

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.001 seconds); all pooled connections were in use


I saw you had a patch into rails core... Does it resolve the issue?
rails/rails#36473

@jeffdoering
Copy link

I had not detailed the issue (oops); but now I have:

#242

Unfortunately my patch to rails core was a much narrower issue that I came across as a side-effect of this deep-dive. I have to eat crow some on the thread above. Makara's Proxy does have a thread-safety issue but the primary issue with failing to fully control real connector interactions with the ActiveRecord ConnectionPool can break things even without threading concerns. See the new issue for full details.

In terms of this issue; the thread safety should be fixed but without a comprehensive solution to the encapsulation problem the pattern in general is unsafe. It's pretty easy to call legal methods on connections and connection pools and break the pool when Makara is used (by throwing errors and/or leaking underlying real connections into the pool).

@kigster
Copy link

kigster commented Apr 29, 2020

@jeremy @aks @bleonard @sax I want to chime in here and say that I am very surprised by this entire thread. When I was at Wanelo, we ended up adopting Makara specifically because it was the only read/write splitting gem that appeared thread safe. I am CC-ing @sax here so that if I forget any issues we had — Eric, please chime in.

But I want to testify that I have successfully used Makara (with PostgreSQL adapter) on two very high traffic Rails applications that were both running sidekiq and puma (in a multi-threaded configuration).

• One site used 5 threads in Puma, 20 in Sidekiq, and reached 200K RPMs traffic to Rails app
• Second site used similar number of threads, and reached 60K RPMs

TL;DR

Neither of the high-traffic sites I mentioned had any issues with multi-threading, and both sites very actively used Makara's automatic recovery and blacklisting and connection pooling features.

Just my 2c.

@kigster
Copy link

kigster commented Apr 29, 2020

@jeffdoering -- Have you filed a detailed issue yet?

We're seeing issues with our ConnectionPool becoming corrupted during database failovers in sidekiq only with makara and we are unable to recover without restarting the servers.

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.001 seconds); all pooled connections were in use

I saw you had a patch into rails core... Does it resolve the issue?
rails/rails#36473

Why do you think that all connections in use error is a result of Makara's corrupting connection pool? In my experience this was caused by the inadequate pool size setting.

When you use multiple threads you should create a much larger pool than the default. Rails defaults to pool size of 10 and I believe assumes a single thread. So create a pool of 30 if you use three threads and see how things go.

If you do this, beware, especially with PostgreSQL,. about too many connections hitting your server. The solution here is pgBouncer which will reduce number of connections by 10X and will allow you to not worry about configuring a large connection pool size in Rails.

Once again, I suspect Makara has nothing to do with this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests