Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UUIDs as IDs for prepared statements in Postgresql #25827

Closed
wants to merge 1 commit into from

Conversation

dv
Copy link
Contributor

@dv dv commented Jul 14, 2016

Summary

Due to certain third-party interrupts (e.g. through Rack::Timeout), it's possible a prepared statement is generated in Postgresql, but never "stored" in Rails. Since the code was interrupted before storing the statement, the @counter variable was never incremented even though it was used to generate a prepared statement.

Next time a prepared statement is tried to be generated, the generated name is identical to the previous "orphaned" prepared statement since the same @counter was used. This culminates in hundreds of the same "prepared statement already exists" error as Rails is unable to move past this one already-in-use name.

This update fixes this problem by using UUID as the prepared-statement name and not relying on the @counter.

@rails-bot
Copy link

Thanks for the pull request, and welcome! The Rails team is excited to review your changes, and you should hear from @schneems (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rafaelfranca
Copy link
Member

This looks good for me.

@sgrif @matthewd thoughts?

@matthewd
Copy link
Member

matthewd commented Jul 17, 2016

It seems pointlessly complex to me*.

If we want to work in the face of arbitrary exceptions, we could just fix the method to increment the counter earlier (I'm pretty sure there's a PR open that does that)... but we haven't, because we don't want to claim that we expect or intend to function in such a hostile environment.

cc @tenderlove

* yes, it's only one line of code.. but a randomly generated UUID is substantially more complex than a counter, IMO

@rafaelfranca
Copy link
Member

Found the PR #17607

@dv
Copy link
Contributor Author

dv commented Jul 18, 2016

My opinion:

In theory you can say yes, we can't fix every spot where a Timeout might create unrecoverable problems.

However in practice I think this one spot is the cause of the majority of them. A lot of (most?) timeouts happen when the database is being slow to respond. That one line of code, generating the prepared statement, will be a spot where a lot of the timeouts happen since it's the first thing that happens when trying to run a query, and a stuck database will freeze the rails app right there until the timeout happens.

This spot is right in the disjunction of the sets of "code where a timeout would happen a lot" and "code that generates unrecoverable errors when a timeout happens". Due to this, I would propose that the slippery slope argument does not hold since this one fix would solve the problem most of the time, and the other timeout-problems that are used to freeze work on this don't have nearly as high a chance of actually happening. The game of whack-a-mole is a silly argument to use if one mole is the cause for 90% of the problems in my opinion.

Isn't the complexity of generating a UUID is insignificant compared to the RTT to the database?

@matthewd
Copy link
Member

@dv I think we're agreeing on the facts, but disagreeing on the conclusion. There is a slippery-slope argument to be made, but it's not the main one I'm looking at:

If we're not going to guarantee that Active Record works in the face of arbitrarily injected exceptions, then it seems more dangerous to have it work almost all the time -- that's how people end up with unsupported configurations deployed in production for a long time, and left utterly confused when it does eventually fail.

As for the slippery-slope: there is always going to be a slowest operation that breaks in the face of an unexpected exception, and it will be responsible for most of the observed incidences. This is just a prepare: it involves a network round-trip, but compared to the time spent actually running a query, it's going to be pretty diminutive.

@sgrif
Copy link
Contributor

sgrif commented Jul 18, 2016

FWIW a much simpler solution would be to attempt to deallocate the existing prepared statement with a given name before preparing it. AFAIK deallocating a prepared statement that doesn't exist is not an error (I'm assuming. We'd have to do it through SQL with a DEALLOCATE command as libpq and therefore the pg gem doesn't expose a function for this, which doesn't document the behavior when the statement is undefined. However, from the PG wire protocol documentation: "It is not an error to issue Close against a nonexistent statement or portal name." and there's no reason to expect that the pure SQL form would behave differently)

@trey-roadster
Copy link

Does this issue still exist in Rails5? Looking at latest source it seems like the @counter is still being used to generate the keys.

Having finally found this year-old thread which explains our intermittent production issues, I have to say the arguments for not merging this are pretty weak:

  1. whack-a-mole - This is basically saying that if we can't fix 100% of the problem, we don't want to fix 95%+ of the problem with a trivial change. This purity is hard to accept for someone trying to serve real customers in the real world with this software. I appreciate the argument could be valid in other cases (like the fix is a 6 page hack), but here we have a dirt-simple, arguably better change with low cost, low risk for immediate large benefit. When perfect isn't on the menu, choose better over worse.

  2. "more complex" - I really don't get this. Big numbers are harder than little numbers? In what cases are people dealing with these keys such that the "complexity" would be an issue? The additional code complexity is a nit. The conceptual complexity is negligible (UUIDS are common and easy to understand). Is this complexity really an issue that weighs well against fixing a confusing, intermittent problem faced by lots of users?

I really see no real cost to just fixing it and making it better. Thank you for any re-consideration.

@dv
Copy link
Contributor Author

dv commented Jul 31, 2017

I don't understand the complexity argument at all, the patch removes code and makes it more easy to read and reason about, and removes state.

@sgrif your solution adds more code and more complexity though, in my opinion that's not simpler.

@machty
Copy link
Contributor

machty commented Sep 12, 2018

@trey-roadster we're on Rails 5.0.6 (long overdue for an upgrade) and we just ran into this issue.

I haven't dug too deeply but I believe the UUID approach proposed in this PR would elegantly solve the issue with minimal complexity / corner cases.

@mhuggins
Copy link

Almost 3 years later on this PR, and it looks like there's no input from the core team? This seems like an elegant solution to a common problem.

@xtagon
Copy link

xtagon commented Apr 24, 2019

This is happening to us on a Rails 5.1.6 project as well and when it happens it's pretty devastating. I understand the argument that Rack::Timeout is unsafe, but in our case the alternative to using it means timeouts frequently cascade because they tie up Rails workers until they finish, even if the client will never receive the response. If this isn't the right PR to solve this problem, then what is the recommended solution?

@trey-roadster
Copy link

We still run into it on Rails 5.2, and have decided to try just turning off prepared statements.

@mhuggins
Copy link

We resolved this by monkey-patching in a rails initializer. I'm not understanding the resistance to merge a simple fix that does not complicate the implementation while also appeasing a legitimate issue being hit for many users.

@xtagon
Copy link

xtagon commented Apr 24, 2019

@mhuggins Would you be willing to share your initializer patch?

@mhuggins
Copy link

Here's my monkey-patch. Please note that this is for a Rails 4.2.x project, meaning there might be changes for this to work in Rails 5.x or 6-beta. I simply placed this into config/initializers/active_record.rb

# Monkey-patch to prevent prepared statements from failing due to thread interruption
# caused by rack-timeout.
#
# Issue details:
#  * https://github.com/rails/rails/issues/22408
#  * https://github.com/rails/rails/pull/17607
#
# Code taken from:
#  * https://github.com/rails/rails/pull/25827/files
module ActiveRecord
  module ConnectionAdapters
    class PostgreSQLAdapter < AbstractAdapter
      class StatementPool < ConnectionAdapters::StatementPool
        def next_key
          "a#{SecureRandom.uuid.delete('-')}"
        end
      end
    end
  end
end

@xtagon
Copy link

xtagon commented Apr 24, 2019

Thanks! I will see if it can be adapted to Rails 5.

@andreynering
Copy link
Contributor

We also regularly face this problem.

Any chance on having a proper fix merged soon?

@gregmolnar
Copy link
Member

@schneems @tenderlove Any thoughts on this? It seems like quite an old issue and would be nice to have a fix for it.

@rails-bot
Copy link

rails-bot bot commented Dec 17, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@rails-bot rails-bot bot added the stale label Dec 17, 2019
@andreynering
Copy link
Contributor

@rails-bot I think it's still worth to fix this bug

@rails-bot rails-bot bot removed the stale label Dec 17, 2019
@rafaelfranca
Copy link
Member

After much consideration I will back @matthewd argument at #25827 (comment) and close this PR. We don't want to claim that we expect or intend to function in such a hostile environment.

@mhuggins
Copy link

This feels really short-sighted.

@masterkain
Copy link
Contributor

I'm running into this using the AWS Aurora Serverless Postgres offering, whereas the database will autoscale CUs.

@uberllama
Copy link
Contributor

I'm running into this using the AWS Aurora Serverless Postgres offering, whereas the database will autoscale CUs.

Is this also an issue when using multi-AZ databases in AWS? I wonder if Aurora's version of Postgres is doing replication differently.

@janko
Copy link
Contributor

janko commented Oct 27, 2020

Why not just turn it around, and first increment the counter, then prepare the statement? That seems much more interrupt-safe to me.

FWIW, that's how Sequel does it. In the code, the counter is first incremented, then the prepared statement name is formed, then the prepared statement is created, and then it's stored in a hash.

@wbharding
Copy link
Contributor

I created a PR #41034 that increments the counter before the statement. This would still allow the possibility that a postgres prepared statement could be "orphaned" (i.e., Rails wouldn't know to remove it), but at least that wouldn't cause the app to remain in a state of indefinite failure. 🤞 that Rails team is receptive to getting this fixed these days.

wbharding added a commit to wbharding/rails that referenced this pull request Jan 6, 2021
…he query, so if the query gets interrupted without Rails knowledge we don't end up with a crashed app in a state of perpetual failure to prepare the next query.

This commit directly inspired by the conversation at rails#25827 (and by a late night crash of our app due to this issue).
@jose-airspace
Copy link

jose-airspace commented Mar 17, 2021

@rafaelfranca a hostile environment? Timeouts happen and we can attempt to increase performance but even with that, they will still happen. This wouldn't be accounting for functioning in a hostile environment but are in fact accounting for the reality that exceptions and timeouts happen. Seems completely ridiculous that a single request timeout can completely take down the entire application for all requests. The solution is simple and removes the possibility of this error. Not sure how a counter that needs to be incremented is simpler than a uuid... @matthewd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet