-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1627: prepared statement pool can be corrupted by exception #17607
Conversation
As stated in #1627 (comment), looks good to me. |
cbe153f
to
a7bdb71
Compare
Even better with the added test. |
a7bdb71
to
c0d21de
Compare
* Make sure prepared statements cache is left in consistent state if exception is raised | ||
in PostgreSQL adapter (#1627). | ||
|
||
*Andrew Bezzub* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newest entry for the CHANGELOG should go to the top. In addition, please add a new line for the issue that is being fixed to ensure that the format is consistent with the other entries. Thanks 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
c0d21de
to
f8c668a
Compare
Haven't been following the discussion but from what I read in the changelog I'd expect a test with an exception and checking for a consistent state. |
@spastorino I copied the explanation to this PR. The problem with writing a test with exception is that this bug is not deterministic, it happens only if exception is raised at exactly the right time (which does not prevent it from happening regularly in production of my app). Test I added in this PR guarantees that the statement pool is going to be in consistent state if implementation of |
And I would argue that the bug is in implementation of the
Thoughts? |
@tenderlove, looks like you authored original statement pool code, can you please look at this? |
Ok, let's wait for Aaron then |
https://github.com/heroku/rack-timeout#timing-out-inherently-unsafe Your proposed change replaces this particular error with a memory leak. But it doesn't address the general problem: there are all sorts of places where it's assumed that if one line of code completes normally, the following line will then be executed. Rephrased as "If I monkey-patch one of these methods to raise a new and exciting exception..", does it sound like a reasonable thing for us to accommodate? I'm really not sure... but doubtful. I don't specifically object to this particular change, but I'm not convinced this is anywhere near the only place that needs help to work properly in spite of rack-timeout's sabotage. |
By memory leak you mean that PostgreSQL can have prepared statement that Rails would not know about? I think this is better than the crash, but I'll try to work on a better fix that does not cause memory leak.
I agree that timeouts should be fixed (and then there would be no way to get into this state), but I am sure there are thousands of apps out there that timeout regularly. Even if your app does not regularly timeout there is a chance that when it does timeout for the first time (for example because of new bug) Rails will crash the server. Plus I am not monkey patching Rails to get into this state, I am using official Heroku gem that Heroku suggests everybody to use. I am not saying we should go and spend all our time trying to fix these things, but having some of them fixed is better than having none fixed. As a developer I always ask "what would happen if this code raised an error?". There is no way to know all places where program can fail, so it might help to be a bit more defensive and clear critical resources if something can go wrong. I understand that there can be many places with such problem, but this is not a reason to ignore the problem. I think in the long run it would be nice to have all such bugs fixed. Developers expect Rails to leave server in the consistent state and in this use case it does not. Any Rails app that runs on Heroku cannot use prepared statements until this problem is fixed. |
I'm pretty sure that would be unmergeably ugly, given that it's protecting against an impossible situation. A minor rearrangement of existing code is palatable... extra behaviour, not so much.
If Heroku are recommending that everyone use a gem whose own readme describes it (quite rightly) as "Inherently Unsafe", that sounds like a matter for them to address. @schneems? @kch? |
There's some discussion about this in previous rack-timeout issues and some allusions in the readme. There's no general solution to this problem, so it's left to users to fix their particular cases. That said, pg problems are fairly common so I'd love to eventually see a gem that addresses those. At the moment, the solution is to use the observer hooks to do state cleanup, like resetting database connections or whatever is required to fix prepared statements. Another option is to use |
The fix is for prepare_statement method: def prepare_statement(sql) sql_key = sql_key(sql) unless @statements.key? sql_key nextkey = @statements.next_key begin @connection.prepare nextkey, sql rescue => e raise translate_exception_class(e, sql) end # Clear the queue @connection.get_last_result @Statements[sql_key] = nextkey end @Statements[sql_key] end If error happens anywhere after "@connection.prepare nextkey, sql" and before "@Statements[sql_key] = nextkey" then connection is left in the inconsistent state and every query that was not prepared will fail (it will be prepared in PostgreSQL but cache won't know that key is reserved). This fix changes next_key to always return different value on every call, so if error is raised before statement is cached then next statemnt will not fail.
f8c668a
to
8394c1f
Compare
What hooks should be used? Can someone please point me to the docs/elaborate on the solution? What is the concern about current fix? It solves the crash in adapter that is described in description of this PR. |
Actually, this is the perfect case for Thread.handle_interrupt(ExeptionClass => :never) {
# do statement stuff
} I'm not sure what Maybe we can add a hook point for Rack runtime so that it can wrap the statement cache updates with a |
@abezzub this no longer cleanly applies, would you be willing to rebase if it's relevant? |
One thing I want to mention is that this isn't the only cache in Rails. We can fix this one, but since
|
Yeah, but what is the alternative? Is it OK to accept that Rails can randomly crash if timeout happens? |
Is it OK to accept that Rails can randomly crash if you deliberately force an arbitrary stack unwind by injecting an exception? Sounds fair to me. I absolutely expect us to handle, to the best of our ability, any reasonable action the operating system may perform -- up to and including a SIGKILL. But accepting that our own process may choose to actively sabotage our behaviour, and then inserting |
Yeah, I see your point and I can agree that this is something Heroku should address. But... Many people who run Rails on Heroku get this error (as you can see by reading comments in #1627) and this error is hard to debug. So is there a reason to not apply this one line fix for this particular problem? |
Well, the "inherently" in "inherently unsafe" is important. There's not a lot that can be done about it. Forced timeouts should be an exceptional situation, if one relies on them consistently for application performance, then they're running a very broken application. Again, the solution is to clean up any broken state using rackt-timeout's hooks. I'd go so far as to recommend only using rack-timeout while in the midst of debugging issues with long-running requests. Ideally, anything that can block or otherwise unpredictably delay a request should happen asynchronously in a background process, or alternatively be handled with a custom-tailored timeout mechanism that kicks in before rack-timeout. |
Issue #1627 was opened back in 2011 to hopefully fix a strange corner-case. After a long time of nobody knowing what's going on the correct sequence of events was puzzled out by @abezzub and a minimal patch was provided here for the specific issue observed. Everybody agrees this is probably not the last issue of this kind, as the I'm not generally a fan of whack-a-mole patches, but in this particular case it's a trivial fix making it possible to use the statement cache (enabled by default) with one fewer error-scenario. To me there's a few possible next steps:
Personally I think 1. is the sad outcome, 3. is extra work with little benefit (and no clear design seems to be emerging). If more test-cases are what's holding back a merge I'd be happy to provide a "fails always" test-case that this diff fixes. Would that at all be helpful? |
FYI Heroku doesn't inject anything. The gem is just often recommended. Sent from my iPhone
|
@kch my mistake, I did not verify that. Any thoughts on a good path forward? |
@rud I don't feel particularly comfortable opining on what's best for rails. I've commented before on the purpose of rack-timeout and potential avenues for recovering from state corruption. What rack-timeout does is basically evil and harmful, but often it's the lesser evil over an app being completely unresponsive and unusable. |
@tenderlove In my experience these kind of timeouts are not caused by timeouts when connecting to the database, rather by slow queries and/or queries that run into lock contention. Especially with Rails default handling of updated_at fields and counter_cache you run into lock contention on heavily updated tables (e.g. users, posts, etc) quite easily when operating at scale. Whilst rack-timeout is an imperfect solution, it'd be nice if Rails could handle the worst case a bit better. We run http://www.producthunt.com/ on Heroku and have been bitten by this exact issue. |
We have run into this issue several times a week since upgrading to Rails 4.1. I find it very unlikely that our application would coincidentally time out on that exact line of code. As @tenderlove speculates, it's possible that the call to |
I'm experiencing the same behavior on Rails 4.2. Are there any new discussion about this issue going on? |
@brodock I don't think this issue has gotten any attention. In my experience, you can simply turn off prepared statements, and the performance impact is negligible. |
In case someone gets here and needs info on how to disable prepared statements: http://stackoverflow.com/questions/26146436/how-to-fix-pgduplicatepstatement-error |
The affected code changes a bit in Rails 5.0 (at least on 5.0.0.beta2): active_record/connection_adapters/postgresql_adapter.rb#L177 def []=(sql, key)
super.tap { @counter += 1 }
end Apparently |
Confirmed it fixes the issue at my project. |
There is a more recent PR #25827 (comment) and a comment on there about an alternative way we could fix the issue. Thinking we should close this PR as it's not going to get merged. If someone wants to take a stab at the alternative implementation mentioned by Sean, he would be the one reviewing and accepting the PR so there's a pretty good chance going with his approach would be accepted. |
@schneems my (pretty widely documented, at this point) core objection isn't to how we'd fix this, but to us attempting to fix it at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good - maybe update the test as suggested?
@rud I can update it if it is going to be applied, I was under the impression that Rails team didn't want this to be merged. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This PR should fix #1627
Problem:
If error is raised in the PostgreSQL adapter at the right time then statement pool is left in the inconsistent state and process has to be restarted. The only solution without this fix is to turn off prepared statements.
Explanation:
My app uses Heroku and they suggest using their version of timeout gem, that raises exception in the main thread if request takes too long:
https://github.com/heroku/rack-timeout/blob/master/lib/rack/timeout.rb#L91-L108
The "prepared statement already exists" error happens in following method in postgres adapter: https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L864-L878
Heroku's gem created timeout exception just at the right time and it corrupted the state of the prepared statements cache. Let's assume timeout exception happens right after
@connection.prepare nextkey, sql
finishes successfully:then Postgres has prepared statement, but
@statements
was not updated and does not know thatnextkey
was already used. So the next time whenprepare_statement
is going to be called it is guaranteed to fail because@statements.next_key
will return the same nextkey that was already used in PG. Current version of code that determines whatnext_key
is:If
nextkey
is reserved before@connection.prepare nextkey, sql
is started then the issue will be fixed. In this case my fix is to just generate newnext_key
every time adapter is asking for it.Timeout exception that caused problems from my app's log: