Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Postgres clients are released back into the pool, but should be removed #5112

Closed
clarkdave opened this issue Nov 19, 2019 · 25 comments · Fixed by #7792
Closed

Broken Postgres clients are released back into the pool, but should be removed #5112

clarkdave opened this issue Nov 19, 2019 · 25 comments · Fixed by #7792

Comments

@clarkdave
Copy link

clarkdave commented Nov 19, 2019

Issue type:

[x] bug report

Database system/driver:

[x] postgres

It may impact other drivers if they have similar semantics/expectations as pg.Pool.

TypeORM version:

[x] latest

Explanation of the problem

Currently in TypeORM, if a client suffers an unrecoverable error - for example, if the underlying connection goes away, such as during a DB failover - there is no protection in place to stop that client being added back into the pg.Pool. This broken client will then be handed out in future even though it'll never be able to execute a successful query.

Although pg.Pool itself does listen for error events on clients within the pool - and actively removes any which do emit errors - it doesn't catch everything. It is considered the responsibility of the user of the pool to release known broken clients by calling client.release(true). The truthy argument tells the pool to destroy the connection instead of adding it back to the pool

https://node-postgres.com/api/pool#releaseCallback

The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.

If a client's connection does break, it's very difficult to debug, as a handful of queries will begin failing with an error like Error: Client has encountered a connection error and is not queryable while others will continue executing fine.

There is further discussion about the impact of this in the node-postgres repo: brianc/node-postgres#1942

Steps to reproduce or a small repository showing the problem

  • Set pool size to 1
  • Run a small loop which repeatedly performs a query, e.g.
while (true) {
  try {
    await getManager().query('select pg_sleep(4)')
    console.log('success')
  } catch (err) {
    console.log('error', err)
  }

  console.log('done')
  await sleep(1000)
}
  • Wait for the query to run at least once. This should allocate a connection from the pool. Subsequent invocations of the query should use the same connection.
  • Kill the connection while query is running. On a Mac or Linux, this is easily done by finding and killing the process, e.g. ps aux | grep postgres | grep SELECT
  • The loop continue to throw errors, as the broken connection has been returned to the pool and continues to be handed out. This shouldn't happen: we should have told pg.Pool the connection is broken so it can be removed from the pool, so the next query should get a new, unbroken, connection

I believe the reason this hasn't been noticed before (at least not that I could see) is because it's really only likely to happen if the actual database connection breaks. The majority of QueryFailedErrors are caused by dodgy SQL etc, none of which will render a client unusable. And, usually, if your database is killing connections, you've got other problems to think about 😅

We only noticed it because we run PgBouncer in between TypeORM and our Postgres server. When we redeployed PgBouncer, it would kill some of the active client connections in the pool, but because pg.Pool never found out about it, those connections remained in the pool indefinitely, causing a steady stream of errors even though everything else was fine.

Fix

I have a working fix here: loyaltylion@6bd52e0

If this fix looks suitable, I'd be happy to create a PR to get this merged. It only applies to postgres, but could be extended to other drivers if we think they'd benefit.

@mlni
Copy link

mlni commented Nov 25, 2019

I just spent a couple of days researching the same issue, where after a DB failover the typeorm connection would remain in error state indefinitely.

I came to the same conclusion: when releasing a connection back to the pool any connection-level errors need to be passed to the pg.Pool in order for the connection to be removed from the pool.

It seems to be possible to handle this locally within PostgresQueryRunner without changing to publicly visible API, by keeping track of any errors that occur during .query() and then passing them along to release callback in .release().

@seriousManual
Copy link

Hi,
I'm seeing the same issue in a serverless environment: The error "Connection terminated unexpectedly" pops up and afterwards subsequent queries fail with "Client has encountered a connection error and is not queryable".

@clarkdave Do you have a hint on how this can be mitigated without patching typeorm or waiting on them to look into your patch?

@azigelman
Copy link

@clarkdave Hey
Do you know if this has been fixed in the official typeorm package?
I don't manage to reproduce using your instructions on my mac.
Thought it might have to do with the version, it looks as if the pool is aware of the broken connection and dumps it

Thanks

@michaelseibt
Copy link

We're experiencing the same issue, a combination of "Connection terminated unexpectedly" and "Client has encountered a connection error and is not queryable".

The PR by @clarkdave looks sound, giving it a try right now in order to validate it.

@592da
Copy link

592da commented Feb 26, 2020

@michaelseibt please update - having this issue over production
@pleerock this issue is serious
#2623 (comment)
#5387
#5223
#5112

Please advise.

@michaelseibt
Copy link

michaelseibt commented Feb 26, 2020

This didn't work out for us. The error is still being thrown and not caught by the try-catch block in the fix mentioned above. Despite a lot of digging through code and GitHub issues, we're still on it to figure out what exactly is going on, but I'm having a hunch that this change here could be causing the issue:

brianc/node-postgres#1324

Knex is having a similar, maybe even the same, issue: knex/knex#3523

BTW, this is happening in our AWS Lambda environment, using Node 10 & 12. @seriousManual also mentioned a serverless environment. @592da Same for you?

Head over to node-postgres for hot updates on the topic: brianc/node-postgres#2112

@brianc
Copy link

brianc commented Feb 26, 2020

I'm looking at this a little bit too - there are a kinda 2 issues

  1. pg doesn't make it easy to ask a client "are you okay?" I'm going to fix this soon and write a bunch of tests for it so we can have an tested, supported, documented way to ask a client about their health. By 'health' i mostly mean are they actually connected still.
  2. clients can get into broken states while they lay idle in the pool and not know about it themselves since sockets can be closed by the os (or in the case I'm seeing, aws lambda) without any notification to the client anything went wrong. I'm thinking I'll probably rely on the readyState field on a net.Socket to tell if the client is okay...but I'll need to do a bit more investigation here.

Anyways...I have a real quick POC patch for this here I'm seeing if someone can apply & see if it fixes the issue.

@592da
Copy link

592da commented Feb 26, 2020

@michaelseibt indeed. serverless, with node 12.
this causing fatal errors for my production env, and currently I need to re-deploy all the lambdas once in a few hours...

this is extremely critical for me since I just migrated the whole application from docker to serverless...
and it hurts :(

@brianc I will test it, hope it will work. will update regarding.
hope you guys will be able to roll out some hotfix, let me know if I can help somehow :)

@592da
Copy link

592da commented Feb 27, 2020

@michaelseibt indeed. serverless, with node 12.
this causing fatal errors for my production env, and currently I need to re-deploy all the lambdas once in a few hours...

this is extremely critical for me since I just migrated the whole application from docker to serverless...
and it hurts :(

@brianc I will test it, hope it will work. will update regarding.
hope you guys will be able to roll out some hotfix, let me know if I can help somehow :)
@brianc
I tested, still having same issue. even with node 8 runtime.
please advice.

@azigelman
Copy link

We suffered the same with running node v12 on docker (Fargate infrastructure)

@GopherJ
Copy link

GopherJ commented Mar 26, 2020

Same issue here

@osdiab
Copy link
Contributor

osdiab commented Mar 31, 2020

Any updates on this issue?

@brianc
Copy link

brianc commented Mar 31, 2020

Any updates on this issue?

yeah the tl; dr is w/o steps to reproduce and a way to write a test case to simulate the issue I'm not going to be able to fix it. I responded here but I don't use lambda myself so I'm going to need to look to the community to step up here & submit a patch. Or at the very least a reliable way to reproduce locally.

@mlni
Copy link

mlni commented Mar 31, 2020

Trouble is, this issue is not reproducible using a classical unit test, as it does not occur on a single machine. The problem is triggered by a machine running postgres getting disconnected from the network, without cleanly shutting down the connection. This does not happen when both processes run under the same kernel.

I could somewhat reliably reproduce the issue under mac by running postgres on docker (in a linux vm) and issuing docker kill postgres, which would trigger a dirty disconnect. That setup is a bit difficult to recreate up as an automated test.

@brianc
Copy link

brianc commented Mar 31, 2020 via email

@mlni
Copy link

mlni commented Mar 31, 2020

I am not entirely convinced that 'pg' is the right layer for fixing this.

One possible workaround for this issue is to have Typeorm pass any connection-level errors back to 'pg', so that the broken connections can be removed from the pool. This behavior is not universally correct, as it would also close connections that hit for example a unique constraint, which can be the wrong thing to do if your code relies heavily on ignoring such errors at a high rate. Maybe there's a reliable way to tell 'connection interrupted' type of errors from more everyday database errors, but I haven't looked into it any deeper.

Here's a link to my workaround for this particular issue.

It successfully cleans up dead connections after performing a database failover on AWS Aurora Postgres, which hung around forever without the workaround.

@mlni
Copy link

mlni commented Mar 31, 2020

I finally remembered why I thought that this is not a pg-layer issue.

pg-pool explicitly removes any error handlers when it hands out a connection to a client, leaving error handling as the responsibility of Typeorm in this case. So it seems logical to assume that a client would report back any errors it got when it returns the connection, for which there is a mechanism in place.

Alternatively, for pg to reliably notice a broken connection, it would need to run a health-check command (like SELECT 1) on any idle connections returned to the pool. This is what most of the connection pools support in java-land, for example, so this would be very helpful.

@Shubhanshu1201
Copy link

Any update on this issue, as i have updated my code in production to node 10.x using serverless and aws aurora with postgres database and is giving me error.
Any help now is greatly appreciated.

@michaelseibt
Copy link

michaelseibt commented Apr 10, 2020

Any help now is greatly appreciated.

What worked for us is to check for broken connections at the beginning of every invocation. This will run a simple query on every client of the pool, and if that fails, will reconnect to the database.

Call this method from inside the event handler function.

const reconnectToDatabase = async () => {
  const connection = getConnection();
  const driver = connection.driver as any;
  for (const client of driver.master._clients) {
    try {
      await client.query('SELECT 1');
    } catch (error) {
      console.info('Reconnecting ...');
      await getConnection().driver.disconnect();
      await getConnection().driver.connect();
      break;
    }
  }
}

(FYI, @brianc - maybe that helps for node-postgres side of the story :-) )

@AlexSun98

This comment has been minimized.

@bifot
Copy link

bifot commented Oct 8, 2020

actually for non-serverless too

@Lakshit-shah

This comment has been minimized.

@cjnoname

This comment has been minimized.

@bkostrowiecki

This comment has been minimized.

@imnotjames
Copy link
Contributor

imnotjames commented Jun 23, 2021

I've opened a PR #7792 which passes the error back to the postgres pool when releasing the connections.

However, for reasons listed above it's pretty difficult to test this beyond manually doing things. We also don't really do mocks.

Anyone else want to test it, too?

bryfox added a commit to bryfox/typeorm that referenced this issue Jul 16, 2022
This extends github PR typeorm#7792 to pass the error, if any, to the release
callback from `pg-pool`. This should be done to ensure the connection is
removed from the pool, as described in typeorm#5112.
pleerock pushed a commit that referenced this issue Aug 24, 2022
This extends github PR #7792 to pass the error, if any, to the release
callback from `pg-pool`. This should be done to ensure the connection is
removed from the pool, as described in #5112.
wirekang pushed a commit to wirekang/typeorm that referenced this issue Aug 25, 2022
This extends github PR typeorm#7792 to pass the error, if any, to the release
callback from `pg-pool`. This should be done to ensure the connection is
removed from the pool, as described in typeorm#5112.
nordinh pushed a commit to nordinh/typeorm that referenced this issue Aug 29, 2022
This extends github PR typeorm#7792 to pass the error, if any, to the release
callback from `pg-pool`. This should be done to ensure the connection is
removed from the pool, as described in typeorm#5112.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.