Fix #1627: prepared statement pool can be corrupted by exception #17607

abezzub · 2014-11-13T10:06:40Z

This PR should fix #1627

Problem:

If error is raised in the PostgreSQL adapter at the right time then statement pool is left in the inconsistent state and process has to be restarted. The only solution without this fix is to turn off prepared statements.

Explanation:

My app uses Heroku and they suggest using their version of timeout gem, that raises exception in the main thread if request takes too long:

https://github.com/heroku/rack-timeout/blob/master/lib/rack/timeout.rb#L91-L108

The "prepared statement already exists" error happens in following method in postgres adapter: https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L864-L878

Heroku's gem created timeout exception just at the right time and it corrupted the state of the prepared statements cache. Let's assume timeout exception happens right after @connection.prepare nextkey, sql finishes successfully:

        def prepare_statement(sql)
          sql_key = sql_key(sql)
          unless @statements.key? sql_key
            nextkey = @statements.next_key
            begin
              @connection.prepare nextkey, sql
            rescue => e
              raise translate_exception_class(e, sql)
            end

            # timeout happens here!

            # Clear the queue
            @connection.get_last_result
            @statements[sql_key] = nextkey # next key is "reserved" here
          end
          @statements[sql_key]
        end

then Postgres has prepared statement, but @statements was not updated and does not know that nextkey was already used. So the next time when prepare_statement is going to be called it is guaranteed to fail because @statements.next_key will return the same nextkey that was already used in PG. Current version of code that determines what next_key is:

        def next_key
          "a#{@counter + 1}"
        end

        def []=(sql, key)
          while @max <= cache.size
            dealloc(cache.shift.last)
          end
          @counter += 1
          cache[sql] = key
        end

If nextkey is reserved before @connection.prepare nextkey, sql is started then the issue will be fixed. In this case my fix is to just generate new next_key every time adapter is asking for it.

Timeout exception that caused problems from my app's log:

Rack::Timeout::RequestTimeoutError (Request ran for longer than 25 seconds.): 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:537:in `dealloc' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:512:in `[]=' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:875:in `prepare_statement' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:826:in `exec_cache' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql/database_statements.rb:138:in `exec_query' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:954:in `select' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/database_statements.rb:24:in `select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `block in select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:83:in `cache_sql' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:265:in `execute_simple_calculation' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:227:in `perform_calculation' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:119:in `calculate' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:75:in `sum'

rud · 2014-11-13T10:14:17Z

As stated in #1627 (comment), looks good to me.

rud · 2014-11-13T21:44:38Z

Even better with the added test.

tgxworld · 2014-11-14T00:29:28Z

activerecord/CHANGELOG.md

+*   Make sure prepared statements cache is left in consistent state if exception is raised
+    in PostgreSQL adapter (#1627).
+
+    *Andrew Bezzub*


The newest entry for the CHANGELOG should go to the top. In addition, please add a new line for the issue that is being fixed to ensure that the format is consistent with the other entries. Thanks 😄

spastorino · 2014-11-14T14:24:13Z

Haven't been following the discussion but from what I read in the changelog I'd expect a test with an exception and checking for a consistent state.

abezzub · 2014-11-14T16:42:37Z

@spastorino I copied the explanation to this PR.

The problem with writing a test with exception is that this bug is not deterministic, it happens only if exception is raised at exactly the right time (which does not prevent it from happening regularly in production of my app). Test I added in this PR guarantees that the statement pool is going to be in consistent state if implementation of prepare_statement method does not change. This is not ideal, but I don't know how to write a better test because this bug is non-deterministic. Let me know if you have any suggestions.

abezzub · 2014-11-14T19:09:27Z

And I would argue that the bug is in implementation of the StatementPool and not in the postgres adapter. I think there might be a way to add a test, but it would still depend on internal implementation details. It could look like this:

      class PostgreSQLAdapter < AbstractAdapter
        class StatementPool < ConnectionAdapters::StatementPool
          def []=(sql, key)
            # do nothing to test a case when statement is not cached due to exception
          end
        end
      end

      def test_exec_with_prepared_statement_works_if_statemnt_is_not_cached
        @connection.exec_query('SELECT 1')
        @connection.exec_query('SELECT 1')
        assert true, 'this test should not raise error'
      end

Thoughts?

abezzub · 2014-11-18T05:19:10Z

@tenderlove, looks like you authored original statement pool code, can you please look at this?

spastorino · 2014-11-18T13:29:28Z

Ok, let's wait for Aaron then

matthewd · 2014-11-18T15:34:35Z

https://github.com/heroku/rack-timeout#timing-out-inherently-unsafe

Your proposed change replaces this particular error with a memory leak. But it doesn't address the general problem: there are all sorts of places where it's assumed that if one line of code completes normally, the following line will then be executed.

Rephrased as "If I monkey-patch one of these methods to raise a new and exciting exception..", does it sound like a reasonable thing for us to accommodate? I'm really not sure... but doubtful.

I don't specifically object to this particular change, but I'm not convinced this is anywhere near the only place that needs help to work properly in spite of rack-timeout's sabotage.

abezzub · 2014-11-18T16:22:12Z

Your proposed change replaces this particular error with a memory leak.

By memory leak you mean that PostgreSQL can have prepared statement that Rails would not know about? I think this is better than the crash, but I'll try to work on a better fix that does not cause memory leak.

there are all sorts of places where it's assumed that if one line of code completes normally, the following line will then be executed.

I agree that timeouts should be fixed (and then there would be no way to get into this state), but I am sure there are thousands of apps out there that timeout regularly. Even if your app does not regularly timeout there is a chance that when it does timeout for the first time (for example because of new bug) Rails will crash the server. Plus I am not monkey patching Rails to get into this state, I am using official Heroku gem that Heroku suggests everybody to use. I am not saying we should go and spend all our time trying to fix these things, but having some of them fixed is better than having none fixed. As a developer I always ask "what would happen if this code raised an error?". There is no way to know all places where program can fail, so it might help to be a bit more defensive and clear critical resources if something can go wrong.

I understand that there can be many places with such problem, but this is not a reason to ignore the problem. I think in the long run it would be nice to have all such bugs fixed. Developers expect Rails to leave server in the consistent state and in this use case it does not. Any Rails app that runs on Heroku cannot use prepared statements until this problem is fixed.

matthewd · 2014-11-18T16:45:15Z

I'll try to work on a better fix that does not cause memory leak.

I'm pretty sure that would be unmergeably ugly, given that it's protecting against an impossible situation. A minor rearrangement of existing code is palatable... extra behaviour, not so much.

I am using official Heroku gem that Heroku suggests everybody to use

If Heroku are recommending that everyone use a gem whose own readme describes it (quite rightly) as "Inherently Unsafe", that sounds like a matter for them to address. @schneems? @kch?

kch · 2014-11-18T19:32:23Z

There's some discussion about this in previous rack-timeout issues and some allusions in the readme. There's no general solution to this problem, so it's left to users to fix their particular cases. That said, pg problems are fairly common so I'd love to eventually see a gem that addresses those.

At the moment, the solution is to use the observer hooks to do state cleanup, like resetting database connections or whatever is required to fix prepared statements.

Another option is to use Thread.handle_interrupt to prevent exceptions from being raised inside critical sections (like around prepare_statement maybe). (It's a bit mind-bending at first but there are reasonable examples in the ruby docs.)

@Statements

The fix is for prepare_statement method: def prepare_statement(sql) sql_key = sql_key(sql) unless @statements.key? sql_key nextkey = @statements.next_key begin @connection.prepare nextkey, sql rescue => e raise translate_exception_class(e, sql) end # Clear the queue @connection.get_last_result @Statements[sql_key] = nextkey end @Statements[sql_key] end If error happens anywhere after "@connection.prepare nextkey, sql" and before "@Statements[sql_key] = nextkey" then connection is left in the inconsistent state and every query that was not prepared will fail (it will be prepared in PostgreSQL but cache won't know that key is reserved). This fix changes next_key to always return different value on every call, so if error is raised before statement is cached then next statemnt will not fail.

abezzub · 2014-11-29T05:59:12Z

At the moment, the solution is to use the observer hooks to do state cleanup, like resetting database connections or whatever is required to fix prepared statements.

What hooks should be used? Can someone please point me to the docs/elaborate on the solution?

What is the concern about current fix? It solves the crash in adapter that is described in description of this PR.

tenderlove · 2015-01-10T22:24:40Z

Actually, this is the perfect case for Thread.handle_interrupt IMO. Something like this:

Thread.handle_interrupt(ExeptionClass => :never) {
  # do statement stuff
}

I'm not sure what ExceptionClass should be though. The Rack timeout error eventually inherits from RuntimeError, and preventing a RuntimeError seems scary. (We can't directly reference the TimeoutError constant in Rack timeout without introducing a dependency on rack runtime).

Maybe we can add a hook point for Rack runtime so that it can wrap the statement cache updates with a handle_interrupt call?

rud · 2015-01-14T08:27:35Z

@abezzub this no longer cleanly applies, would you be willing to rebase if it's relevant?

tenderlove · 2015-01-14T22:21:16Z

One thing I want to mention is that this isn't the only cache in Rails. We can fix this one, but since Rack::Timeout can send a kill signal to your app at anytime, this will basically be a never ending game of whack-a-mole.

If Heroku are recommending that everyone use a gem whose own readme describes it (quite rightly) as "Inherently Unsafe", that sounds like a matter for them to address.

I have to agree with @matthewd here. /cc @schneems @hone

abezzub · 2015-01-15T00:43:27Z

this will basically be a never ending game of whack-a-mole.

Yeah, but what is the alternative? Is it OK to accept that Rails can randomly crash if timeout happens?

matthewd · 2015-01-15T07:08:08Z

Is it OK to accept that Rails can randomly crash if timeout happens?

Is it OK to accept that Rails can randomly crash if you deliberately force an arbitrary stack unwind by injecting an exception?

Sounds fair to me.

I absolutely expect us to handle, to the best of our ability, any reasonable action the operating system may perform -- up to and including a SIGKILL. But accepting that our own process may choose to actively sabotage our behaviour, and then inserting just_please_not_this_bit blocks? 😒

abezzub · 2015-01-15T08:09:38Z

Yeah, I see your point and I can agree that this is something Heroku should address. But... Many people who run Rails on Heroku get this error (as you can see by reading comments in #1627) and this error is hard to debug. So is there a reason to not apply this one line fix for this particular problem?

kch · 2015-01-18T01:08:57Z

Well, the "inherently" in "inherently unsafe" is important. There's not a lot that can be done about it. Forced timeouts should be an exceptional situation, if one relies on them consistently for application performance, then they're running a very broken application.

Again, the solution is to clean up any broken state using rackt-timeout's hooks.

I'd go so far as to recommend only using rack-timeout while in the midst of debugging issues with long-running requests. Ideally, anything that can block or otherwise unpredictably delay a request should happen asynchronously in a background process, or alternatively be handled with a custom-tailored timeout mechanism that kicks in before rack-timeout.

rud · 2015-01-18T15:11:41Z

Issue #1627 was opened back in 2011 to hopefully fix a strange corner-case. After a long time of nobody knowing what's going on the correct sequence of events was puzzled out by @abezzub and a minimal patch was provided here for the specific issue observed.

Everybody agrees this is probably not the last issue of this kind, as the timeout gem injected by Heroku at runtime is a terrible terrible idea.

I'm not generally a fan of whack-a-mole patches, but in this particular case it's a trivial fix making it possible to use the statement cache (enabled by default) with one fewer error-scenario.

To me there's a few possible next steps:

AR 3.1 / pgbouncer / PostgreSQL issue with prepared statements #1627 is unfixable in the Rails project as it's caused by an unsupported configuration (the inclusion of the timeout gem) and should be closed as wontfix 😞
or a variant of this small fix is included
alternatively, patch is rewritten using Thread.handle_interrupt but essentially having the same behaviour (ensure @count is updated before each prepared statement is sent to PostgreSQL.

Personally I think 1. is the sad outcome, 3. is extra work with little benefit (and no clear design seems to be emerging).

If more test-cases are what's holding back a merge I'd be happy to provide a "fails always" test-case that this diff fixes. Would that at all be helpful?

kch · 2015-01-18T18:33:59Z

FYI Heroku doesn't inject anything. The gem is just often recommended.

Sent from my iPhone

On 18 Jan 2015, at 07:12, Laust Rud Jacobsen notifications@github.com wrote:

Issue #1627 was opened back in 2011 to hopefully fix a strange corner-case. After a long time of nobody knowing what's going on the correct sequence of events was puzzled out by @abezzub and a minimal patch was provided here for the specific issue observed.

Everybody agrees this is probably not the last issue of this kind, as the timeout gem injected by Heroku at runtime is a terrible terrible idea.

I'm not generally a fan of whack-a-mole patches, but in this particular case it's a trivial fix making it possible to use the statement cache (enabled by default) with one fewer error-scenario.

To me there's a few possible next steps:

AR 3.1 / pgbouncer / PostgreSQL issue with prepared statements #1627 is unfixable in the Rails project as it's caused by an unsupported configuration (the inclusion of the timeout gem) and should be closed as wontfix

or a variant of this small fix is included

alternatively, patch is rewritten using Thread.handle_interrupt but essentially having the same behaviour (ensure @count is updated before each prepared statement is sent to PostgreSQL.

Personally I think 1. is the sad outcome, 3. is extra work with little benefit (and no clear design seems to be emerging).

If more test-cases are what's holding back a merge I'd be happy to provide a "fails always" test-case that this diff fixes. Would that at all be helpful?

—
Reply to this email directly or view it on GitHub.

rud · 2015-01-18T21:46:06Z

@kch my mistake, I did not verify that. Any thoughts on a good path forward?

kch · 2015-01-19T19:08:15Z

@rud I don't feel particularly comfortable opining on what's best for rails. I've commented before on the purpose of rack-timeout and potential avenues for recovering from state corruption.

What rack-timeout does is basically evil and harmful, but often it's the lesser evil over an app being completely unresponsive and unusable.

lfittl · 2015-01-19T22:41:03Z

@tenderlove In my experience these kind of timeouts are not caused by timeouts when connecting to the database, rather by slow queries and/or queries that run into lock contention.

Especially with Rails default handling of updated_at fields and counter_cache you run into lock contention on heavily updated tables (e.g. users, posts, etc) quite easily when operating at scale.

Whilst rack-timeout is an imperfect solution, it'd be nice if Rails could handle the worst case a bit better. We run http://www.producthunt.com/ on Heroku and have been bitten by this exact issue.

dwbutler · 2015-01-27T21:26:32Z

We have run into this issue several times a week since upgrading to Rails 4.1. I find it very unlikely that our application would coincidentally time out on that exact line of code. As @tenderlove speculates, it's possible that the call to @connection.prepare is not configured to time out properly.

brodock · 2015-08-06T17:55:53Z

I'm experiencing the same behavior on Rails 4.2. Are there any new discussion about this issue going on?

dwbutler · 2015-08-07T03:59:24Z

@brodock I don't think this issue has gotten any attention. In my experience, you can simply turn off prepared statements, and the performance impact is negligible.

yaneeka · 2015-08-26T21:02:06Z

In case someone gets here and needs info on how to disable prepared statements:

http://stackoverflow.com/questions/26146436/how-to-fix-pgduplicatepstatement-error

leonelgalan · 2016-02-16T15:17:48Z

The affected code changes a bit in Rails 5.0 (at least on 5.0.0.beta2):

active_record/connection_adapters/postgresql_adapter.rb#L177

        def []=(sql, key)
          super.tap { @counter += 1 }
        end

Apparently []= already does what it needs, simply dropping the .tap should do the trick, since @counter would be incremented in next_key.

genki · 2016-03-07T08:43:01Z

Confirmed it fixes the issue at my project.

schneems · 2016-09-22T16:57:01Z

There is a more recent PR #25827 (comment) and a comment on there about an alternative way we could fix the issue. Thinking we should close this PR as it's not going to get merged. If someone wants to take a stab at the alternative implementation mentioned by Sean, he would be the one reviewing and accepting the PR so there's a pretty good chance going with his approach would be accepted.

matthewd · 2016-09-22T17:02:56Z

@schneems my (pretty widely documented, at this point) core objection isn't to how we'd fix this, but to us attempting to fix it at all.

rud

Still looks good - maybe update the test as suggested?

abezzub · 2019-09-12T17:36:56Z

@rud I can update it if it is going to be applied, I was under the impression that Rails team didn't want this to be merged.

rails-bot · 2019-12-23T17:49:28Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

abezzub changed the title ~~Fix for prepared statement already exists~~ Fix for prepared statement already exists (Postgres) Nov 13, 2014

abezzub force-pushed the postgres-bug branch from cbe153f to a7bdb71 Compare November 13, 2014 21:26

abezzub force-pushed the postgres-bug branch from a7bdb71 to c0d21de Compare November 13, 2014 22:09

tgxworld reviewed Nov 14, 2014
View reviewed changes

abezzub force-pushed the postgres-bug branch from c0d21de to f8c668a Compare November 14, 2014 00:38

abezzub changed the title ~~Fix for prepared statement already exists (Postgres)~~ Fix #1627: prepared statement pool can be corrupted by exception Nov 18, 2014

abezzub force-pushed the postgres-bug branch from f8c668a to 8394c1f Compare November 29, 2014 05:47

mooreniemi mentioned this pull request May 7, 2015

oversized prepared statements cache rsim/oracle-enhanced#594

Closed

jdwolk mentioned this pull request Sep 14, 2015

Remove rack-timeout philosophie/suspenders#70

Closed

lfittl mentioned this pull request Sep 14, 2015

ActiveRecord: Protect PostgreSQL query execution against Timeout::Error #21628

Closed

rails-bot assigned rafaelfranca Oct 20, 2015

matthewd mentioned this pull request Dec 18, 2015

Prevent repeated DuplicatePstatement errors in prepared statement creation #22593

Closed

maclover7 added activerecord PostgreSQL labels Jan 10, 2016

ivanreese mentioned this pull request Mar 26, 2016

If DuplicatePstatement will happen once, activerecord goes into DuplicatePstatement loop #22408

Closed

rafaelfranca mentioned this pull request Jul 17, 2016

Use UUIDs as IDs for prepared statements in Postgresql #25827

Closed

matthewd mentioned this pull request Sep 22, 2016

Duplicate Prepared Statement with Thread#kill #26588

Closed

michaeldever mentioned this pull request Jun 11, 2018

Deallocate preexisting prepare statement identifiers #33116

Closed

rud approved these changes Sep 12, 2019

View reviewed changes

rails-bot bot added the stale label Dec 23, 2019

rails-bot bot closed this Dec 30, 2019

Matt-Yorkley mentioned this pull request Jun 30, 2020

Checkout addresses openfoodfoundation/openfoodnetwork#5703

Merged

mtsmfm mentioned this pull request Oct 21, 2020

Proposal to deprecate term_on_timeout: false zombocom/rack-timeout#169

Closed

rafaelfranca mentioned this pull request Jan 6, 2021

Prevent app crash when postgres prepared query keys become out of sync with Rails #41034

Closed

rafaelfranca mentioned this pull request Mar 24, 2021

Prevent app crash when postgres prepared query keys become out of sync with Rails #41356

Merged

Fix #1627: prepared statement pool can be corrupted by exception #17607

Fix #1627: prepared statement pool can be corrupted by exception #17607

Conversation

abezzub commented Nov 13, 2014

rud commented Nov 13, 2014

rud commented Nov 13, 2014

tgxworld Nov 14, 2014

Choose a reason for hiding this comment

abezzub Nov 14, 2014

Choose a reason for hiding this comment

spastorino commented Nov 14, 2014

abezzub commented Nov 14, 2014

abezzub commented Nov 14, 2014

abezzub commented Nov 18, 2014

spastorino commented Nov 18, 2014

matthewd commented Nov 18, 2014

abezzub commented Nov 18, 2014

matthewd commented Nov 18, 2014

kch commented Nov 18, 2014

abezzub commented Nov 29, 2014

tenderlove commented Jan 10, 2015

rud commented Jan 14, 2015

tenderlove commented Jan 14, 2015

abezzub commented Jan 15, 2015

matthewd commented Jan 15, 2015

abezzub commented Jan 15, 2015

kch commented Jan 18, 2015

rud commented Jan 18, 2015

kch commented Jan 18, 2015

rud commented Jan 18, 2015

kch commented Jan 19, 2015

lfittl commented Jan 19, 2015

dwbutler commented Jan 27, 2015

brodock commented Aug 6, 2015

dwbutler commented Aug 7, 2015

yaneeka commented Aug 26, 2015

leonelgalan commented Feb 16, 2016

genki commented Mar 7, 2016

schneems commented Sep 22, 2016

matthewd commented Sep 22, 2016

rud left a comment

Choose a reason for hiding this comment

abezzub commented Sep 12, 2019

rails-bot bot commented Dec 23, 2019