Broken Postgres clients are released back into the pool, but should be removed #5112

clarkdave · 2019-11-19T20:36:06Z

Issue type:

[x] bug report

Database system/driver:

[x] postgres

It may impact other drivers if they have similar semantics/expectations as pg.Pool.

TypeORM version:

[x] latest

Explanation of the problem

Currently in TypeORM, if a client suffers an unrecoverable error - for example, if the underlying connection goes away, such as during a DB failover - there is no protection in place to stop that client being added back into the pg.Pool. This broken client will then be handed out in future even though it'll never be able to execute a successful query.

Although pg.Pool itself does listen for error events on clients within the pool - and actively removes any which do emit errors - it doesn't catch everything. It is considered the responsibility of the user of the pool to release known broken clients by calling client.release(true). The truthy argument tells the pool to destroy the connection instead of adding it back to the pool

https://node-postgres.com/api/pool#releaseCallback

The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.

If a client's connection does break, it's very difficult to debug, as a handful of queries will begin failing with an error like Error: Client has encountered a connection error and is not queryable while others will continue executing fine.

There is further discussion about the impact of this in the node-postgres repo: brianc/node-postgres#1942

Steps to reproduce or a small repository showing the problem

Set pool size to 1
Run a small loop which repeatedly performs a query, e.g.

while (true) {
  try {
    await getManager().query('select pg_sleep(4)')
    console.log('success')
  } catch (err) {
    console.log('error', err)
  }

  console.log('done')
  await sleep(1000)
}

Wait for the query to run at least once. This should allocate a connection from the pool. Subsequent invocations of the query should use the same connection.
Kill the connection while query is running. On a Mac or Linux, this is easily done by finding and killing the process, e.g. ps aux | grep postgres | grep SELECT
The loop continue to throw errors, as the broken connection has been returned to the pool and continues to be handed out. This shouldn't happen: we should have told pg.Pool the connection is broken so it can be removed from the pool, so the next query should get a new, unbroken, connection

I believe the reason this hasn't been noticed before (at least not that I could see) is because it's really only likely to happen if the actual database connection breaks. The majority of QueryFailedErrors are caused by dodgy SQL etc, none of which will render a client unusable. And, usually, if your database is killing connections, you've got other problems to think about 😅

We only noticed it because we run PgBouncer in between TypeORM and our Postgres server. When we redeployed PgBouncer, it would kill some of the active client connections in the pool, but because pg.Pool never found out about it, those connections remained in the pool indefinitely, causing a steady stream of errors even though everything else was fine.

Fix

I have a working fix here: loyaltylion@6bd52e0

If this fix looks suitable, I'd be happy to create a PR to get this merged. It only applies to postgres, but could be extended to other drivers if we think they'd benefit.

The text was updated successfully, but these errors were encountered:

mlni · 2019-11-25T11:54:06Z

I just spent a couple of days researching the same issue, where after a DB failover the typeorm connection would remain in error state indefinitely.

I came to the same conclusion: when releasing a connection back to the pool any connection-level errors need to be passed to the pg.Pool in order for the connection to be removed from the pool.

It seems to be possible to handle this locally within PostgresQueryRunner without changing to publicly visible API, by keeping track of any errors that occur during .query() and then passing them along to release callback in .release().

seriousManual · 2019-12-11T15:28:39Z

Hi,
I'm seeing the same issue in a serverless environment: The error "Connection terminated unexpectedly" pops up and afterwards subsequent queries fail with "Client has encountered a connection error and is not queryable".

@clarkdave Do you have a hint on how this can be mitigated without patching typeorm or waiting on them to look into your patch?

azigelman · 2020-01-15T16:54:40Z

@clarkdave Hey
Do you know if this has been fixed in the official typeorm package?
I don't manage to reproduce using your instructions on my mac.
Thought it might have to do with the version, it looks as if the pool is aware of the broken connection and dumps it

Thanks

michaelseibt · 2020-02-25T10:23:51Z

We're experiencing the same issue, a combination of "Connection terminated unexpectedly" and "Client has encountered a connection error and is not queryable".

The PR by @clarkdave looks sound, giving it a try right now in order to validate it.

592da · 2020-02-26T14:31:26Z

@michaelseibt please update - having this issue over production
@pleerock this issue is serious
#2623 (comment)
#5387
#5223
#5112

Please advise.

michaelseibt · 2020-02-26T17:33:58Z

This didn't work out for us. The error is still being thrown and not caught by the try-catch block in the fix mentioned above. Despite a lot of digging through code and GitHub issues, we're still on it to figure out what exactly is going on, but I'm having a hunch that this change here could be causing the issue:

brianc/node-postgres#1324

Knex is having a similar, maybe even the same, issue: knex/knex#3523

BTW, this is happening in our AWS Lambda environment, using Node 10 & 12. @seriousManual also mentioned a serverless environment. @592da Same for you?

Head over to node-postgres for hot updates on the topic: brianc/node-postgres#2112

brianc · 2020-02-26T17:54:46Z

I'm looking at this a little bit too - there are a kinda 2 issues

pg doesn't make it easy to ask a client "are you okay?" I'm going to fix this soon and write a bunch of tests for it so we can have an tested, supported, documented way to ask a client about their health. By 'health' i mostly mean are they actually connected still.
clients can get into broken states while they lay idle in the pool and not know about it themselves since sockets can be closed by the os (or in the case I'm seeing, aws lambda) without any notification to the client anything went wrong. I'm thinking I'll probably rely on the readyState field on a net.Socket to tell if the client is okay...but I'll need to do a bit more investigation here.

Anyways...I have a real quick POC patch for this here I'm seeing if someone can apply & see if it fixes the issue.

592da · 2020-02-26T18:26:25Z

@michaelseibt indeed. serverless, with node 12.
this causing fatal errors for my production env, and currently I need to re-deploy all the lambdas once in a few hours...

this is extremely critical for me since I just migrated the whole application from docker to serverless...
and it hurts :(

@brianc I will test it, hope it will work. will update regarding.
hope you guys will be able to roll out some hotfix, let me know if I can help somehow :)

592da · 2020-02-27T08:46:36Z

@michaelseibt indeed. serverless, with node 12.
this causing fatal errors for my production env, and currently I need to re-deploy all the lambdas once in a few hours...

this is extremely critical for me since I just migrated the whole application from docker to serverless...
and it hurts :(

@brianc I will test it, hope it will work. will update regarding.
hope you guys will be able to roll out some hotfix, let me know if I can help somehow :)
@brianc
I tested, still having same issue. even with node 8 runtime.
please advice.

azigelman · 2020-02-27T11:53:21Z

We suffered the same with running node v12 on docker (Fargate infrastructure)

GopherJ · 2020-03-26T12:57:50Z

Same issue here

osdiab · 2020-03-31T08:48:35Z

Any updates on this issue?

brianc · 2020-03-31T15:59:39Z

Any updates on this issue?

yeah the tl; dr is w/o steps to reproduce and a way to write a test case to simulate the issue I'm not going to be able to fix it. I responded here but I don't use lambda myself so I'm going to need to look to the community to step up here & submit a patch. Or at the very least a reliable way to reproduce locally.

mlni · 2020-03-31T17:05:50Z

Trouble is, this issue is not reproducible using a classical unit test, as it does not occur on a single machine. The problem is triggered by a machine running postgres getting disconnected from the network, without cleanly shutting down the connection. This does not happen when both processes run under the same kernel.

I could somewhat reliably reproduce the issue under mac by running postgres on docker (in a linux vm) and issuing docker kill postgres, which would trigger a dirty disconnect. That setup is a bit difficult to recreate up as an automated test.

brianc · 2020-03-31T17:30:27Z

yah mos def - Ideally it's automatable....if not I wont be able to be _sure_ of a fix...though I could potentially hack around until I got it working, nothing is going to guarantee it'll remain working. When you get it to reproduce does the socket still think it's connected even though it's not? There are some flags (different depending on version of node) on net.Socket which indicate if its connected or not...are you able to inspect those & see if they're reporting the truth? If they are, I can inspect them inside PG (and likely put them into the same state via mocking or some other means in a test). If they're lying and saying they're still connected when they're not....that's....gonna be more problematic.

…

On Tue, Mar 31, 2020 at 12:06 PM Matti Jagula ***@***.***> wrote: Trouble is, this issue is not reproducible using a classical unit test, as it does not occur on a single machine. The problem is triggered by a machine running postgres getting disconnected from the network, without cleanly shutting down the connection. This does not happen when both processes run under the same kernel. I could somewhat reliably reproduce the issue under mac by running postgres on docker (in a linux vm) and issuing docker kill postgres, which would trigger a dirty disconnect. That setup is a bit difficult to recreate up as an automated test. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMHIOV4MU44NJIOQUJVMDRKIPH5ANCNFSM4JPJBIQQ> .

mlni · 2020-03-31T17:55:04Z

I am not entirely convinced that 'pg' is the right layer for fixing this.

One possible workaround for this issue is to have Typeorm pass any connection-level errors back to 'pg', so that the broken connections can be removed from the pool. This behavior is not universally correct, as it would also close connections that hit for example a unique constraint, which can be the wrong thing to do if your code relies heavily on ignoring such errors at a high rate. Maybe there's a reliable way to tell 'connection interrupted' type of errors from more everyday database errors, but I haven't looked into it any deeper.

Here's a link to my workaround for this particular issue.

It successfully cleans up dead connections after performing a database failover on AWS Aurora Postgres, which hung around forever without the workaround.

mlni · 2020-03-31T18:57:35Z

I finally remembered why I thought that this is not a pg-layer issue.

pg-pool explicitly removes any error handlers when it hands out a connection to a client, leaving error handling as the responsibility of Typeorm in this case. So it seems logical to assume that a client would report back any errors it got when it returns the connection, for which there is a mechanism in place.

Alternatively, for pg to reliably notice a broken connection, it would need to run a health-check command (like SELECT 1) on any idle connections returned to the pool. This is what most of the connection pools support in java-land, for example, so this would be very helpful.

Shubhanshu1201 · 2020-04-10T07:53:55Z

Any update on this issue, as i have updated my code in production to node 10.x using serverless and aws aurora with postgres database and is giving me error.
Any help now is greatly appreciated.

michaelseibt · 2020-04-10T09:38:58Z

Any help now is greatly appreciated.

What worked for us is to check for broken connections at the beginning of every invocation. This will run a simple query on every client of the pool, and if that fails, will reconnect to the database.

Call this method from inside the event handler function.

const reconnectToDatabase = async () => {
  const connection = getConnection();
  const driver = connection.driver as any;
  for (const client of driver.master._clients) {
    try {
      await client.query('SELECT 1');
    } catch (error) {
      console.info('Reconnecting ...');
      await getConnection().driver.disconnect();
      await getConnection().driver.connect();
      break;
    }
  }
}

(FYI, @brianc - maybe that helps for node-postgres side of the story :-) )

bifot · 2020-10-08T19:52:13Z

actually for non-serverless too

imnotjames · 2021-06-23T23:11:45Z

I've opened a PR #7792 which passes the error back to the postgres pool when releasing the connections.

However, for reasons listed above it's pretty difficult to test this beyond manually doing things. We also don't really do mocks.

Anyone else want to test it, too?

This extends github PR typeorm#7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in typeorm#5112.

This extends github PR #7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in #5112.

This extends github PR typeorm#7792 to pass the error, if any, to the release callback from `pg-pool`. This should be done to ensure the connection is removed from the pool, as described in typeorm#5112.

joras mentioned this issue May 4, 2020

PostgresQueryRunner not handling idle connection errors #5999

Closed

AlexSun98 mentioned this issue Jun 1, 2020

Query failed - aws lambda + postgres + typeOrm #6148

Closed

This comment has been minimized.

Sign in to view

imnotjames added bug driver: postgres labels Oct 6, 2020

imnotjames mentioned this issue Oct 6, 2020

"Connection terminated unexpectedly" and "Client has encountered a connection error and is not queryable" #5223

Closed

imnotjames added the comp: drivers and platforms label Oct 17, 2020

This comment has been minimized.

Sign in to view

Kononnable mentioned this issue Feb 24, 2021

Maintenance phase Kononnable/typeorm-model-generator#329

Closed

This comment has been minimized.

Sign in to view

michaelseibt mentioned this issue Jun 17, 2021

7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) brianc/node-postgres#2112

Open

This was referenced Jun 23, 2021

Postgres emitted error goes unhandled (when a connection dies) - app crashes #5387

Closed

fix: prevent reuse of broken connections in postgres pool #7792

Merged

imnotjames closed this as completed in #7792 Jun 28, 2021

bryfox mentioned this issue Jul 16, 2022

fix: pass error to pgpool release #9209

Merged

7 tasks

Informatic mentioned this issue Jul 3, 2023

fix: backport postgres connection error handling to crdb #10177

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken Postgres clients are released back into the pool, but should be removed #5112

Broken Postgres clients are released back into the pool, but should be removed #5112

clarkdave commented Nov 19, 2019 •

edited

mlni commented Nov 25, 2019

seriousManual commented Dec 11, 2019

azigelman commented Jan 15, 2020

michaelseibt commented Feb 25, 2020

592da commented Feb 26, 2020

michaelseibt commented Feb 26, 2020 •

edited

brianc commented Feb 26, 2020

592da commented Feb 26, 2020

592da commented Feb 27, 2020

azigelman commented Feb 27, 2020

GopherJ commented Mar 26, 2020

osdiab commented Mar 31, 2020

brianc commented Mar 31, 2020

mlni commented Mar 31, 2020

brianc commented Mar 31, 2020 via email

mlni commented Mar 31, 2020

mlni commented Mar 31, 2020

Shubhanshu1201 commented Apr 10, 2020

michaelseibt commented Apr 10, 2020 •

edited

This comment has been minimized.

bifot commented Oct 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

imnotjames commented Jun 23, 2021 •

edited

Broken Postgres clients are released back into the pool, but should be removed #5112

Broken Postgres clients are released back into the pool, but should be removed #5112

Comments

clarkdave commented Nov 19, 2019 • edited

mlni commented Nov 25, 2019

seriousManual commented Dec 11, 2019

azigelman commented Jan 15, 2020

michaelseibt commented Feb 25, 2020

592da commented Feb 26, 2020

michaelseibt commented Feb 26, 2020 • edited

brianc commented Feb 26, 2020

592da commented Feb 26, 2020

592da commented Feb 27, 2020

azigelman commented Feb 27, 2020

GopherJ commented Mar 26, 2020

osdiab commented Mar 31, 2020

brianc commented Mar 31, 2020

mlni commented Mar 31, 2020

brianc commented Mar 31, 2020 via email

mlni commented Mar 31, 2020

mlni commented Mar 31, 2020

Shubhanshu1201 commented Apr 10, 2020

michaelseibt commented Apr 10, 2020 • edited

This comment has been minimized.

bifot commented Oct 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

imnotjames commented Jun 23, 2021 • edited

clarkdave commented Nov 19, 2019 •

edited

michaelseibt commented Feb 26, 2020 •

edited

michaelseibt commented Apr 10, 2020 •

edited

imnotjames commented Jun 23, 2021 •

edited