Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1627: prepared statement pool can be corrupted by exception #17607

Closed
wants to merge 1 commit into from

Conversation

abezzub
Copy link

@abezzub abezzub commented Nov 13, 2014

This PR should fix #1627

Problem:

If error is raised in the PostgreSQL adapter at the right time then statement pool is left in the inconsistent state and process has to be restarted. The only solution without this fix is to turn off prepared statements.

Explanation:

My app uses Heroku and they suggest using their version of timeout gem, that raises exception in the main thread if request takes too long:

https://github.com/heroku/rack-timeout/blob/master/lib/rack/timeout.rb#L91-L108

The "prepared statement already exists" error happens in following method in postgres adapter: https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L864-L878

Heroku's gem created timeout exception just at the right time and it corrupted the state of the prepared statements cache. Let's assume timeout exception happens right after @connection.prepare nextkey, sql finishes successfully:

        def prepare_statement(sql)
          sql_key = sql_key(sql)
          unless @statements.key? sql_key
            nextkey = @statements.next_key
            begin
              @connection.prepare nextkey, sql
            rescue => e
              raise translate_exception_class(e, sql)
            end

            # timeout happens here!

            # Clear the queue
            @connection.get_last_result
            @statements[sql_key] = nextkey # next key is "reserved" here
          end
          @statements[sql_key]
        end

then Postgres has prepared statement, but @statements was not updated and does not know that nextkey was already used. So the next time when prepare_statement is going to be called it is guaranteed to fail because @statements.next_key will return the same nextkey that was already used in PG. Current version of code that determines what next_key is:

        def next_key
          "a#{@counter + 1}"
        end

        def []=(sql, key)
          while @max <= cache.size
            dealloc(cache.shift.last)
          end
          @counter += 1
          cache[sql] = key
        end

If nextkey is reserved before @connection.prepare nextkey, sql is started then the issue will be fixed. In this case my fix is to just generate new next_key every time adapter is asking for it.

Timeout exception that caused problems from my app's log:

Rack::Timeout::RequestTimeoutError (Request ran for longer than 25 seconds.): 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:537:in `dealloc' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:512:in `[]=' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:875:in `prepare_statement' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:826:in `exec_cache' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql/database_statements.rb:138:in `exec_query' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/postgresql_adapter.rb:954:in `select' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/database_statements.rb:24:in `select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `block in select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:83:in `cache_sql' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `select_all' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:265:in `execute_simple_calculation' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:227:in `perform_calculation' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:119:in `calculate' 
vendor/bundle/ruby/2.1.0/gems/activerecord-4.1.6/lib/active_record/relation/calculations.rb:75:in `sum' 

@rud
Copy link
Contributor

rud commented Nov 13, 2014

As stated in #1627 (comment), looks good to me.

@abezzub abezzub changed the title Fix for prepared statement already exists Fix for prepared statement already exists (Postgres) Nov 13, 2014
@rud
Copy link
Contributor

rud commented Nov 13, 2014

Even better with the added test.

* Make sure prepared statements cache is left in consistent state if exception is raised
in PostgreSQL adapter (#1627).

*Andrew Bezzub*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newest entry for the CHANGELOG should go to the top. In addition, please add a new line for the issue that is being fixed to ensure that the format is consistent with the other entries. Thanks 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@spastorino
Copy link
Contributor

Haven't been following the discussion but from what I read in the changelog I'd expect a test with an exception and checking for a consistent state.

@abezzub
Copy link
Author

abezzub commented Nov 14, 2014

@spastorino I copied the explanation to this PR.

The problem with writing a test with exception is that this bug is not deterministic, it happens only if exception is raised at exactly the right time (which does not prevent it from happening regularly in production of my app). Test I added in this PR guarantees that the statement pool is going to be in consistent state if implementation of prepare_statement method does not change. This is not ideal, but I don't know how to write a better test because this bug is non-deterministic. Let me know if you have any suggestions.

@abezzub
Copy link
Author

abezzub commented Nov 14, 2014

And I would argue that the bug is in implementation of the StatementPool and not in the postgres adapter. I think there might be a way to add a test, but it would still depend on internal implementation details. It could look like this:

      class PostgreSQLAdapter < AbstractAdapter
        class StatementPool < ConnectionAdapters::StatementPool
          def []=(sql, key)
            # do nothing to test a case when statement is not cached due to exception
          end
        end
      end

      def test_exec_with_prepared_statement_works_if_statemnt_is_not_cached
        @connection.exec_query('SELECT 1')
        @connection.exec_query('SELECT 1')
        assert true, 'this test should not raise error'
      end

Thoughts?

@abezzub abezzub changed the title Fix for prepared statement already exists (Postgres) Fix #1627: prepared statement pool can be corrupted by exception Nov 18, 2014
@abezzub
Copy link
Author

abezzub commented Nov 18, 2014

@tenderlove, looks like you authored original statement pool code, can you please look at this?

@spastorino
Copy link
Contributor

Ok, let's wait for Aaron then

@matthewd
Copy link
Member

https://github.com/heroku/rack-timeout#timing-out-inherently-unsafe

Your proposed change replaces this particular error with a memory leak. But it doesn't address the general problem: there are all sorts of places where it's assumed that if one line of code completes normally, the following line will then be executed.

Rephrased as "If I monkey-patch one of these methods to raise a new and exciting exception..", does it sound like a reasonable thing for us to accommodate? I'm really not sure... but doubtful.

I don't specifically object to this particular change, but I'm not convinced this is anywhere near the only place that needs help to work properly in spite of rack-timeout's sabotage.

@abezzub
Copy link
Author

abezzub commented Nov 18, 2014

Your proposed change replaces this particular error with a memory leak.

By memory leak you mean that PostgreSQL can have prepared statement that Rails would not know about? I think this is better than the crash, but I'll try to work on a better fix that does not cause memory leak.

there are all sorts of places where it's assumed that if one line of code completes normally, the following line will then be executed.

I agree that timeouts should be fixed (and then there would be no way to get into this state), but I am sure there are thousands of apps out there that timeout regularly. Even if your app does not regularly timeout there is a chance that when it does timeout for the first time (for example because of new bug) Rails will crash the server. Plus I am not monkey patching Rails to get into this state, I am using official Heroku gem that Heroku suggests everybody to use. I am not saying we should go and spend all our time trying to fix these things, but having some of them fixed is better than having none fixed. As a developer I always ask "what would happen if this code raised an error?". There is no way to know all places where program can fail, so it might help to be a bit more defensive and clear critical resources if something can go wrong.

I understand that there can be many places with such problem, but this is not a reason to ignore the problem. I think in the long run it would be nice to have all such bugs fixed. Developers expect Rails to leave server in the consistent state and in this use case it does not. Any Rails app that runs on Heroku cannot use prepared statements until this problem is fixed.

@matthewd
Copy link
Member

I'll try to work on a better fix that does not cause memory leak.

I'm pretty sure that would be unmergeably ugly, given that it's protecting against an impossible situation. A minor rearrangement of existing code is palatable... extra behaviour, not so much.

I am using official Heroku gem that Heroku suggests everybody to use

If Heroku are recommending that everyone use a gem whose own readme describes it (quite rightly) as "Inherently Unsafe", that sounds like a matter for them to address. @schneems? @kch?

@kch
Copy link
Contributor

kch commented Nov 18, 2014

There's some discussion about this in previous rack-timeout issues and some allusions in the readme. There's no general solution to this problem, so it's left to users to fix their particular cases. That said, pg problems are fairly common so I'd love to eventually see a gem that addresses those.

At the moment, the solution is to use the observer hooks to do state cleanup, like resetting database connections or whatever is required to fix prepared statements.

Another option is to use Thread.handle_interrupt to prevent exceptions from being raised inside critical sections (like around prepare_statement maybe). (It's a bit mind-bending at first but there are reasonable examples in the ruby docs.)

  The fix is for prepare_statement method:

    def prepare_statement(sql)
      sql_key = sql_key(sql)
      unless @statements.key? sql_key
        nextkey = @statements.next_key
        begin
          @connection.prepare nextkey, sql
        rescue => e
          raise translate_exception_class(e, sql)
        end
        # Clear the queue
        @connection.get_last_result
        @Statements[sql_key] = nextkey
      end
      @Statements[sql_key]
    end

  If error happens anywhere after "@connection.prepare nextkey, sql"
  and before "@Statements[sql_key] = nextkey" then connection is
  left in the inconsistent state and every query that was not
  prepared will fail (it will be prepared in PostgreSQL but cache
  won't know that key is reserved). This fix changes next_key
  to always return different value on every call, so if error is
  raised before statement is cached then next statemnt will not fail.
@abezzub
Copy link
Author

abezzub commented Nov 29, 2014

At the moment, the solution is to use the observer hooks to do state cleanup, like resetting database connections or whatever is required to fix prepared statements.

What hooks should be used? Can someone please point me to the docs/elaborate on the solution?

What is the concern about current fix? It solves the crash in adapter that is described in description of this PR.

@tenderlove
Copy link
Member

Actually, this is the perfect case for Thread.handle_interrupt IMO. Something like this:

Thread.handle_interrupt(ExeptionClass => :never) {
  # do statement stuff
}

I'm not sure what ExceptionClass should be though. The Rack timeout error eventually inherits from RuntimeError, and preventing a RuntimeError seems scary. (We can't directly reference the TimeoutError constant in Rack timeout without introducing a dependency on rack runtime).

Maybe we can add a hook point for Rack runtime so that it can wrap the statement cache updates with a handle_interrupt call?

@rud
Copy link
Contributor

rud commented Jan 14, 2015

@abezzub this no longer cleanly applies, would you be willing to rebase if it's relevant?

@tenderlove
Copy link
Member

One thing I want to mention is that this isn't the only cache in Rails. We can fix this one, but since Rack::Timeout can send a kill signal to your app at anytime, this will basically be a never ending game of whack-a-mole.

If Heroku are recommending that everyone use a gem whose own readme describes it (quite rightly) as "Inherently Unsafe", that sounds like a matter for them to address.

I have to agree with @matthewd here. /cc @schneems @hone

@abezzub
Copy link
Author

abezzub commented Jan 15, 2015

this will basically be a never ending game of whack-a-mole.

Yeah, but what is the alternative? Is it OK to accept that Rails can randomly crash if timeout happens?

@matthewd
Copy link
Member

Is it OK to accept that Rails can randomly crash if timeout happens?

Is it OK to accept that Rails can randomly crash if you deliberately force an arbitrary stack unwind by injecting an exception?

Sounds fair to me.

I absolutely expect us to handle, to the best of our ability, any reasonable action the operating system may perform -- up to and including a SIGKILL. But accepting that our own process may choose to actively sabotage our behaviour, and then inserting just_please_not_this_bit blocks? 😒

@abezzub
Copy link
Author

abezzub commented Jan 15, 2015

Yeah, I see your point and I can agree that this is something Heroku should address. But... Many people who run Rails on Heroku get this error (as you can see by reading comments in #1627) and this error is hard to debug. So is there a reason to not apply this one line fix for this particular problem?

@kch
Copy link
Contributor

kch commented Jan 18, 2015

Well, the "inherently" in "inherently unsafe" is important. There's not a lot that can be done about it. Forced timeouts should be an exceptional situation, if one relies on them consistently for application performance, then they're running a very broken application.

Again, the solution is to clean up any broken state using rackt-timeout's hooks.

I'd go so far as to recommend only using rack-timeout while in the midst of debugging issues with long-running requests. Ideally, anything that can block or otherwise unpredictably delay a request should happen asynchronously in a background process, or alternatively be handled with a custom-tailored timeout mechanism that kicks in before rack-timeout.

@rud
Copy link
Contributor

rud commented Jan 18, 2015

Issue #1627 was opened back in 2011 to hopefully fix a strange corner-case. After a long time of nobody knowing what's going on the correct sequence of events was puzzled out by @abezzub and a minimal patch was provided here for the specific issue observed.

Everybody agrees this is probably not the last issue of this kind, as the timeout gem injected by Heroku at runtime is a terrible terrible idea.

I'm not generally a fan of whack-a-mole patches, but in this particular case it's a trivial fix making it possible to use the statement cache (enabled by default) with one fewer error-scenario.

To me there's a few possible next steps:

  1. AR 3.1 / pgbouncer / PostgreSQL issue with prepared statements #1627 is unfixable in the Rails project as it's caused by an unsupported configuration (the inclusion of the timeout gem) and should be closed as wontfix 😞
  2. or a variant of this small fix is included
  3. alternatively, patch is rewritten using Thread.handle_interrupt but essentially having the same behaviour (ensure @count is updated before each prepared statement is sent to PostgreSQL.

Personally I think 1. is the sad outcome, 3. is extra work with little benefit (and no clear design seems to be emerging).

If more test-cases are what's holding back a merge I'd be happy to provide a "fails always" test-case that this diff fixes. Would that at all be helpful?

@kch
Copy link
Contributor

kch commented Jan 18, 2015

FYI Heroku doesn't inject anything. The gem is just often recommended.

Sent from my iPhone

On 18 Jan 2015, at 07:12, Laust Rud Jacobsen notifications@github.com wrote:

Issue #1627 was opened back in 2011 to hopefully fix a strange corner-case. After a long time of nobody knowing what's going on the correct sequence of events was puzzled out by @abezzub and a minimal patch was provided here for the specific issue observed.

Everybody agrees this is probably not the last issue of this kind, as the timeout gem injected by Heroku at runtime is a terrible terrible idea.

I'm not generally a fan of whack-a-mole patches, but in this particular case it's a trivial fix making it possible to use the statement cache (enabled by default) with one fewer error-scenario.

To me there's a few possible next steps:

  1. AR 3.1 / pgbouncer / PostgreSQL issue with prepared statements #1627 is unfixable in the Rails project as it's caused by an unsupported configuration (the inclusion of the timeout gem) and should be closed as wontfix
  2. or a variant of this small fix is included
  3. alternatively, patch is rewritten using Thread.handle_interrupt but essentially having the same behaviour (ensure @count is updated before each prepared statement is sent to PostgreSQL.

Personally I think 1. is the sad outcome, 3. is extra work with little benefit (and no clear design seems to be emerging).

If more test-cases are what's holding back a merge I'd be happy to provide a "fails always" test-case that this diff fixes. Would that at all be helpful?


Reply to this email directly or view it on GitHub.

@rud
Copy link
Contributor

rud commented Jan 18, 2015

@kch my mistake, I did not verify that. Any thoughts on a good path forward?

@kch
Copy link
Contributor

kch commented Jan 19, 2015

@rud I don't feel particularly comfortable opining on what's best for rails. I've commented before on the purpose of rack-timeout and potential avenues for recovering from state corruption.

What rack-timeout does is basically evil and harmful, but often it's the lesser evil over an app being completely unresponsive and unusable.

@lfittl
Copy link

lfittl commented Jan 19, 2015

@tenderlove In my experience these kind of timeouts are not caused by timeouts when connecting to the database, rather by slow queries and/or queries that run into lock contention.

Especially with Rails default handling of updated_at fields and counter_cache you run into lock contention on heavily updated tables (e.g. users, posts, etc) quite easily when operating at scale.

Whilst rack-timeout is an imperfect solution, it'd be nice if Rails could handle the worst case a bit better. We run http://www.producthunt.com/ on Heroku and have been bitten by this exact issue.

@dwbutler
Copy link
Contributor

We have run into this issue several times a week since upgrading to Rails 4.1. I find it very unlikely that our application would coincidentally time out on that exact line of code. As @tenderlove speculates, it's possible that the call to @connection.prepare is not configured to time out properly.

@brodock
Copy link

brodock commented Aug 6, 2015

I'm experiencing the same behavior on Rails 4.2. Are there any new discussion about this issue going on?

@dwbutler
Copy link
Contributor

dwbutler commented Aug 7, 2015

@brodock I don't think this issue has gotten any attention. In my experience, you can simply turn off prepared statements, and the performance impact is negligible.

@yaneeka
Copy link

yaneeka commented Aug 26, 2015

In case someone gets here and needs info on how to disable prepared statements:

http://stackoverflow.com/questions/26146436/how-to-fix-pgduplicatepstatement-error

@leonelgalan
Copy link
Contributor

The affected code changes a bit in Rails 5.0 (at least on 5.0.0.beta2):

active_record/connection_adapters/postgresql_adapter.rb#L177

        def []=(sql, key)
          super.tap { @counter += 1 }
        end

Apparently []= already does what it needs, simply dropping the .tap should do the trick, since @counter would be incremented in next_key.

@genki
Copy link

genki commented Mar 7, 2016

Confirmed it fixes the issue at my project.

@schneems
Copy link
Member

There is a more recent PR #25827 (comment) and a comment on there about an alternative way we could fix the issue. Thinking we should close this PR as it's not going to get merged. If someone wants to take a stab at the alternative implementation mentioned by Sean, he would be the one reviewing and accepting the PR so there's a pretty good chance going with his approach would be accepted.

@matthewd
Copy link
Member

@schneems my (pretty widely documented, at this point) core objection isn't to how we'd fix this, but to us attempting to fix it at all.

Copy link
Contributor

@rud rud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good - maybe update the test as suggested?

@abezzub
Copy link
Author

abezzub commented Sep 12, 2019

@rud I can update it if it is going to be applied, I was under the impression that Rails team didn't want this to be merged.

@rails-bot
Copy link

rails-bot bot commented Dec 23, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AR 3.1 / pgbouncer / PostgreSQL issue with prepared statements