Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra write to database: broken pipe #1599

Closed
jacksontj opened this issue Oct 9, 2019 · 10 comments · Fixed by #1736, #1738 or #1739
Closed

Hydra write to database: broken pipe #1599

jacksontj opened this issue Oct 9, 2019 · 10 comments · Fixed by #1736, #1738 or #1739

Comments

@jacksontj
Copy link

Describe the bug

Requests to hydra (running with at least postgres) can sometimes hit an "idle" (but closed) connection to the DB. In this scenario the caller gets back:

error_hint=write tcp SRC:37738->DST:5432: write: broken pipe

I've run into similar errors in other apps, and the issue is upstream in lib/pq as well (lib/pq#870).

I was able to work around this issue by setting max_conn_lifetime=10s in the DSN -- as it avoids the connections being held open but actually closed on the remote.

Reproducing the bug

This has been annoying to reproduce on-demand. I've noticed this when tokens go to get refreshed from the IDP -- specifically after a long time (8hrs in this case) such that the connection can close out.

Expected behavior

I would expect that L4 errors such as this would be transparently retried instead of bubbling all the way to the client.

@aeneasr
Copy link
Member

aeneasr commented Oct 9, 2019

Using these settings is the appropriate way of handling this. You did not provide the version string, but we're using best effort defaults to work around a broken connection pool.

max_conn_lifetime forcibly terminates connections, which is why we don't use that per default and why Go has this disabled per default. It's only required when the network is extremely flaky (e.g. connecting to Google SQL from a local machine) or when Postgres is configured in such a way that idle connections are closed.

Unless the driver give us an indicator that a connection should be retried, this would not make sense because errors such as 404, 409, 400 should not be retried with (exponential) backoff or else all these errors will take forever to load. On top of that, the sql connection pool (from Golang) is actually capable of handling retries in connection issue scenarios:

When this happens sql.DB handles it gracefully. Bad connections will automatically be retried twice before giving up, at which point Go will remove the connection from the pool and create a new one.

So this really looks a driver issue to me, and not an issue we want to solve with "code on top" to fix something, especially if max_conn_lifetime seems to be working fine.

I would be open to set max_conn_lifetime to a default that's not 0 as the performance impact should be negligible when we're above something reasonable like 1h. However, I'm not sure if it actually resolves the issue at hand when there's a lot of pressure on the server and e.g. the PostgreSQL server restarts. If you can confirm that simply setting this to non-zero solves the broken pipe error in all scenarios I think this is a nobrainer to fix!

@jacksontj
Copy link
Author

So this really looks a driver issue to me, and not an issue we want to solve with "code on top" to fix something, especially if max_conn_lifetime seems to be working fine.

Seems reasonable, from the upstream issue it seems that the driver isn't properly marking those connections for the sql.DB client to retry them properly.

I would be open to set max_conn_lifetime to a default that's not 0 as the performance impact should be negligible when we're above something reasonable like 1h.

I'm not sure what a reasonable default would be. In my case I just have a postgres box in a cloud provider and have run into this issue but I haven't found the source of the connection closing yet. THe value would need to be lower than whatever is closing the connections, in my case I to the extreme (10s) since I need it to always work (and my load is pretty low right now). So again, not sure we want to set a default timeout (maybe just have some docs explaining the situation?). It seems that pgx (another pg driver for go -- https://github.com/jackc/pgx) has this retry logic implemented -- jackc/pgx@4868929

So maybe a simpler solution is swapping :/

@aeneasr
Copy link
Member

aeneasr commented Oct 11, 2019

Interesting, I didn't know this driver existed. Swapping the driver is definitely a bit of work because we have to handle error codes correctly etc.

But if lib/pq fails to deliver it's a viable solution.

@aeneasr
Copy link
Member

aeneasr commented Oct 14, 2019

It seems like pgx explicitly targets software that only connects to PostgreSQL by providing additional methods and being able to work with types not supported by database/sql. If that holds true, we can not use that driver here as we need support for cockroach and mysql.

@aeneasr
Copy link
Member

aeneasr commented Oct 14, 2019

I checked around a bit and it doesn't seem that what you're trying to do (retrying on e.g. broken pipe errors) is something most sql driver implementations are supporting as it's not possible to distinguish if the server accepted the request or not. From the official Go mysql driver:

@RobertGrantEllis I agree that checking connection is closed or not before sending command is preferable. But (a) it requires dedicated reader goroutine, and (b) SetConnMaxLifetime() is far better than it.

And even MySQL Connector/C (official implementation of client) doesn't try detecting broken connection before sending packet.

Source: go-sql-driver/mysql#529 (comment)

It does appear that there's a PR for checking MySQL pruned connections ( go-sql-driver/mysql#934 ) but I'm not sure if that's even an issue with PostgreSQL. If it was, and this was unsolved, I'm sure we could convince the maintainers to accept a similar PR.

@jacksontj
Copy link
Author

It seems like pgx explicitly targets software that only connects to PostgreSQL by providing additional methods and being able to work with types not supported by database/sql. If that holds true, we can not use that driver here as we need support for cockroach and mysql.

pgx also has a stdlib interface (https://github.com/jackc/pgx/tree/master/stdlib) to use the same go sql interface. TBH I haven't used it with cockroach but I would expect it to work (since its PG API), as for mysql lib/pq doesn't support mysql either, right?

Checked around a bit and it doesn't seem that what you're trying to do (retrying on e.g. broken pipe errors) is something most sql driver implementations are supporting as it's not possible to distinguish if the server accepted the request or not.

IIRC there's a way to check at the syscall level -- but I'm not sure if thats easily accessible on the go side of things (I assume it could be). This is really only an issue for idle connections that when used are actually broken. Now that I've set my idle timeout very short I just don't see this issue anymore.

It seems like we'd want lib/pq to basically do the same as here -- go-sql-driver/mysql#934

@aeneasr
Copy link
Member

aeneasr commented Oct 15, 2019

Do you know if go-sql-driver/mysql#934 is included in pgx/stdlib? I tried to re-animate a discussion in lib/pq that tried to solve this using another way but so far no response.

@aeneasr
Copy link
Member

aeneasr commented Oct 18, 2019

Anyways, I'm closing this here. The proposed solution to retry on failure is not an acceptable solutions as I've laid out in previous comments. This needs to be fixed in the driver itself.

We could switch to pgx/stdlib assuming that it performs better than lib/pq in total and specifically these scenarios.

@aeneasr aeneasr closed this as completed Oct 18, 2019
@glerchundi
Copy link
Contributor

@aeneasr
Copy link
Member

aeneasr commented Feb 19, 2020

If switching to pgx/stlib is a one-liner I'd be up for that

@aeneasr aeneasr reopened this Feb 19, 2020
glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020
glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020
glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020
glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020
glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 21, 2020
aeneasr pushed a commit that referenced this issue Feb 23, 2020
aeneasr added a commit that referenced this issue Feb 23, 2020
Closes #1599

Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>
aeneasr added a commit that referenced this issue Feb 28, 2020
Closes #1599

Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>
eli-zh pushed a commit to eli-zh/hydra that referenced this issue Mar 22, 2020
eli-zh pushed a commit to eli-zh/hydra that referenced this issue Mar 22, 2020
Closes ory#1599

Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants