New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken Postgres clients are released back into the pool, but should be removed #5112
Comments
I just spent a couple of days researching the same issue, where after a DB failover the typeorm connection would remain in error state indefinitely. I came to the same conclusion: when releasing a connection back to the pool any connection-level errors need to be passed to the pg.Pool in order for the connection to be removed from the pool. It seems to be possible to handle this locally within |
Hi, @clarkdave Do you have a hint on how this can be mitigated without patching typeorm or waiting on them to look into your patch? |
@clarkdave Hey Thanks |
We're experiencing the same issue, a combination of "Connection terminated unexpectedly" and "Client has encountered a connection error and is not queryable". The PR by @clarkdave looks sound, giving it a try right now in order to validate it. |
@michaelseibt please update - having this issue over production Please advise. |
This didn't work out for us. The error is still being thrown and not caught by the Knex is having a similar, maybe even the same, issue: knex/knex#3523 BTW, this is happening in our AWS Lambda environment, using Node 10 & 12. @seriousManual also mentioned a serverless environment. @592da Same for you? Head over to node-postgres for hot updates on the topic: brianc/node-postgres#2112 |
I'm looking at this a little bit too - there are a kinda 2 issues
Anyways...I have a real quick POC patch for this here I'm seeing if someone can apply & see if it fixes the issue. |
@michaelseibt indeed. serverless, with node 12. this is extremely critical for me since I just migrated the whole application from docker to serverless... @brianc I will test it, hope it will work. will update regarding. |
|
We suffered the same with running node v12 on docker (Fargate infrastructure) |
Same issue here |
Any updates on this issue? |
yeah the tl; dr is w/o steps to reproduce and a way to write a test case to simulate the issue I'm not going to be able to fix it. I responded here but I don't use lambda myself so I'm going to need to look to the community to step up here & submit a patch. Or at the very least a reliable way to reproduce locally. |
Trouble is, this issue is not reproducible using a classical unit test, as it does not occur on a single machine. The problem is triggered by a machine running postgres getting disconnected from the network, without cleanly shutting down the connection. This does not happen when both processes run under the same kernel. I could somewhat reliably reproduce the issue under mac by running postgres on docker (in a linux vm) and issuing |
yah mos def - Ideally it's automatable....if not I wont be able to be
_sure_ of a fix...though I could potentially hack around until I got it
working, nothing is going to guarantee it'll remain working. When you get
it to reproduce does the socket still think it's connected even though it's
not? There are some flags (different depending on version of node) on
net.Socket which indicate if its connected or not...are you able to inspect
those & see if they're reporting the truth? If they are, I can inspect
them inside PG (and likely put them into the same state via mocking or some
other means in a test). If they're lying and saying they're still
connected when they're not....that's....gonna be more problematic.
…On Tue, Mar 31, 2020 at 12:06 PM Matti Jagula ***@***.***> wrote:
Trouble is, this issue is not reproducible using a classical unit test, as
it does not occur on a single machine. The problem is triggered by a
machine running postgres getting disconnected from the network, without
cleanly shutting down the connection. This does not happen when both
processes run under the same kernel.
I could somewhat reliably reproduce the issue under mac by running
postgres on docker (in a linux vm) and issuing docker kill postgres,
which would trigger a dirty disconnect. That setup is a bit difficult to
recreate up as an automated test.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMHIOV4MU44NJIOQUJVMDRKIPH5ANCNFSM4JPJBIQQ>
.
|
I am not entirely convinced that 'pg' is the right layer for fixing this. One possible workaround for this issue is to have Typeorm pass any connection-level errors back to 'pg', so that the broken connections can be removed from the pool. This behavior is not universally correct, as it would also close connections that hit for example a unique constraint, which can be the wrong thing to do if your code relies heavily on ignoring such errors at a high rate. Maybe there's a reliable way to tell 'connection interrupted' type of errors from more everyday database errors, but I haven't looked into it any deeper. Here's a link to my workaround for this particular issue. It successfully cleans up dead connections after performing a database failover on AWS Aurora Postgres, which hung around forever without the workaround. |
I finally remembered why I thought that this is not a pg-layer issue. pg-pool explicitly removes any error handlers when it hands out a connection to a client, leaving error handling as the responsibility of Typeorm in this case. So it seems logical to assume that a client would report back any errors it got when it returns the connection, for which there is a mechanism in place. Alternatively, for pg to reliably notice a broken connection, it would need to run a health-check command (like SELECT 1) on any idle connections returned to the pool. This is what most of the connection pools support in java-land, for example, so this would be very helpful. |
Any update on this issue, as i have updated my code in production to node 10.x using serverless and aws aurora with postgres database and is giving me error. |
What worked for us is to check for broken connections at the beginning of every invocation. This will run a simple query on every client of the pool, and if that fails, will reconnect to the database. Call this method from inside the event handler function.
(FYI, @brianc - maybe that helps for node-postgres side of the story :-) ) |
This comment has been minimized.
This comment has been minimized.
actually for non-serverless too |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I've opened a PR #7792 which passes the error back to the postgres pool when releasing the connections. However, for reasons listed above it's pretty difficult to test this beyond manually doing things. We also don't really do mocks. Anyone else want to test it, too? |
This extends github PR typeorm#7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in typeorm#5112.
This extends github PR typeorm#7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in typeorm#5112.
This extends github PR typeorm#7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in typeorm#5112.
Issue type:
[x] bug report
Database system/driver:
[x]
postgres
It may impact other drivers if they have similar semantics/expectations as
pg.Pool
.TypeORM version:
[x]
latest
Explanation of the problem
Currently in TypeORM, if a client suffers an unrecoverable error - for example, if the underlying connection goes away, such as during a DB failover - there is no protection in place to stop that client being added back into the
pg.Pool
. This broken client will then be handed out in future even though it'll never be able to execute a successful query.Although
pg.Pool
itself does listen for error events on clients within the pool - and actively removes any which do emit errors - it doesn't catch everything. It is considered the responsibility of the user of the pool to release known broken clients by callingclient.release(true)
. The truthy argument tells the pool to destroy the connection instead of adding it back to the poolhttps://node-postgres.com/api/pool#releaseCallback
If a client's connection does break, it's very difficult to debug, as a handful of queries will begin failing with an error like
Error: Client has encountered a connection error and is not queryable
while others will continue executing fine.There is further discussion about the impact of this in the
node-postgres
repo: brianc/node-postgres#1942Steps to reproduce or a small repository showing the problem
ps aux | grep postgres | grep SELECT
pg.Pool
the connection is broken so it can be removed from the pool, so the next query should get a new, unbroken, connectionI believe the reason this hasn't been noticed before (at least not that I could see) is because it's really only likely to happen if the actual database connection breaks. The majority of
QueryFailedErrors
are caused by dodgy SQL etc, none of which will render a client unusable. And, usually, if your database is killing connections, you've got other problems to think about 😅We only noticed it because we run PgBouncer in between TypeORM and our Postgres server. When we redeployed PgBouncer, it would kill some of the active client connections in the pool, but because
pg.Pool
never found out about it, those connections remained in the pool indefinitely, causing a steady stream of errors even though everything else was fine.Fix
I have a working fix here: loyaltylion@6bd52e0
If this fix looks suitable, I'd be happy to create a PR to get this merged. It only applies to postgres, but could be extended to other drivers if we think they'd benefit.
The text was updated successfully, but these errors were encountered: