Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Disable CheckConnLiveness by default. #1451

Closed
methane opened this issue Jun 21, 2023 · 9 comments
Closed

RFC: Disable CheckConnLiveness by default. #1451

methane opened this issue Jun 21, 2023 · 9 comments
Milestone

Comments

@methane
Copy link
Member

methane commented Jun 21, 2023

If Go supports nonblocking readable test, we can fix #996 and #1392.
Until then, how about disable the option by default in v1.8?
I don't talking about removing it so Githubber can keep using the option.

@owbone
Copy link
Contributor

owbone commented Jun 28, 2023

ER_CLIENT_INTERACTION_TIMEOUT is the only disconnection error that can be avoided through well-configured timeouts. There are other error packets, such as ER_DISCONNECTING_REMAINING_CLIENTS that might be received regardless of any client-side setting. I think it would be prudent to avoid making any assumptions about what error packets are being received and why.

Ignoring timeouts, I'm sure that CheckConnLiveness was valuable when dealing with ER_DISCONNECTING_REMAINING_CLIENTS. Some connections were hit by #1393 and #1394, which caused a large number of failed requests, but CheckConnLiveness ensured that the majority of the connections were discarded and replaced with healthy ones. We don't explicitly enable it, so if it had been disabled by default then I'm sure that a few thousand more user requests would have failed in the way described by #934.

It's not just intentional disconnections by the server that it deals with either. I believe CheckConnLiveness also gracefully handles other types of connection loss, such as server crashes and networking issues. I suspect we probably rely on that quite a lot without even knowing it, and this project is widely used enough that I'm sure many others do too.

So, my opinion is that disabling CheckConnLiveness by default could be an unhappy surprise for a lot of users, and most others probably wouldn't notice any performance benefits when compared to the overall round-trip time.

I understand how connCheck() adds that complexity by reading directly from the fd, which messes with the net.Conn. Since conncheck.go only builds on Unix-like OS then I wonder if that situation could be improved by using unix.Poll() instead of syscall.Read()?

@methane
Copy link
Member Author

methane commented Jun 28, 2023

Some connections were hit by #1393 and #1394, which caused a large number of failed requests, but CheckConnLiveness ensured that the majority of the connections were discarded and replaced with healthy ones. We don't explicitly enable it, so if it had been disabled by default then I'm sure that a few thousand more user requests would have failed in the way described by #934.

Do you mean you counted few thousand "unexpected read from socket" errors in log?

It's not just intentional disconnections by the server that it deals with either. I believe CheckConnLiveness also gracefully handles other types of connection loss, such as server crashes and networking issues. I suspect we probably rely on that quite a lot without even knowing it, and this project is widely used enough that I'm sure many others do too.

Do you mean many user don't see "unexpected read from socket" although it happened quite a lot?
If so, we take away important info about network healthiness from users. It would be a good reason to disable it by default.

So, my opinion is that disabling CheckConnLiveness by default could be an unhappy surprise for a lot of users, and most others probably wouldn't notice any performance benefits when compared to the overall round-trip time.

As far as I know, MySQL Connector/C doesn't check readiness before sending query. Many users builds robust application without it. I don't know why only Go needs it by default. (I understand it helps GitHub backends.)

Since conncheck.go only builds on Unix-like OS then I wonder if that situation could be improved by using unix.Poll() instead of syscall.Read()?

It is interesting idea, although it introduce dependency to this module.

@owbone
Copy link
Contributor

owbone commented Jun 28, 2023

Do you mean you counted few thousand "unexpected read from socket" errors in log?

Yes, exactly.

Do you mean many user don't see "unexpected read from socket" although it happened quite a lot?
If so, we take away important info about network healthiness from users. It would be a good reason to disable it by default.

I mean that I would guess that many users also see a lot of "unexpected EOF", but they just ignore it because everything still works as long as new connections are created silently and successfully.

I agree that it could mask an underlying network problems, but only for idle connections. If a service loses all connections to the server then you'd also expect that active queries will fail, and maybe new connections will fail, both of which are the more important things that won't be hidden.

As far as I know, MySQL Connector/C doesn't check readiness before sending query. Many users builds robust application without it. I don't know why only Go needs it by default.

I think the major difference is that go-sql-driver/mysql must work with database/sql's connection pooling.

As a counterexample, mysql-connector-python has a connection pool and does check is_connected() before getting a connection, which looks like it does something very similar in terms of checking for unread results.

(I understand it helps GitHub backends.)

GitHub isn't doing anything special here other than using Go to interact with MySQL at a very large scale. That scale means that we hit weird edge cases more than most others, and those weird edge cases impact more users than most others, but that doesn't mean these things are unique to us. I think it would be better to think of our observations as examples rather than exceptions.

Just to be clear, it's also a coincidence that the original author of this code happened to be a GitHub employee at the time. #1392, #1393, and #1394 were raised independently and with no knowledge of that. It was actually a surprise to me.

So, the reason I'm trying to defend CheckConnLiveness isn't because it's special to us, but because I can say with plenty of empirical evidence that it has been hugely advantageous for us, and therefore I expect that it's been advantageous to many others too.

@methane
Copy link
Member Author

methane commented Jun 29, 2023

I agree that it could mask an underlying network problems, but only for idle connections.

Unlealthy network kills not only idle connections, but also connections waiting results/sending query.
So it is important to notice unhealthy network sonner than later.

Just to be clear, it's also a coincidence that the original author of this code happened to be a GitHub employee at the time. #1392, #1393, and #1394 were raised independently and with no knowledge of that. It was actually a surprise to me.

So, the reason I'm trying to defend CheckConnLiveness isn't because it's special to us, but because I can say with plenty of empirical evidence that it has been hugely advantageous for us, and therefore I expect that it's been advantageous to many others too.

#1392, #1393, #1394 are special because shorter lifetime than wait_timeout is recommended to solution for most users. It is not good solution because GitHub is special.

As a counterexample, mysql-connector-python has a connection pool and does check is_connected() before getting a connection, which looks like it does something very similar in terms of checking for unread results.

OK, it makes sense to me. Check readability is lighter than pre-ping.

@owbone
Copy link
Contributor

owbone commented Jun 29, 2023

#1392, #1393, #1394 are special because shorter lifetime than wait_timeout is recommended to solution for most users. It is not good solution because GitHub is special.

I think you've misunderstood our situation:

  1. We do now follow that recommendation and have still hit all of those bugs, because there are many other reasons why the server might disconnect, not just timeouts.

  2. There was no special reason why we weren't following that recommendation to begin with. It just didn't cause any issues until we upgraded to MySQL 8 (and hit "commands out of sync" errors on MySQL 8.0.24 #1393 and Connections stuck with "busy buffer" after ErrPktSync #1394).

This is why I say we're not special. Other users will not be following the recommended timeouts, because they haven't needed to (thanks to CheckConnLiveness) - a recommendation is not the same thing as an important requirement. Other users will still hit #1392, #1393, and #1394 even when following the recommendation, because other server errors and disconnections can happen to anyone.

As a counterexample, mysql-connector-python has a connection pool and does check is_connected() before getting a connection, which looks like it does something very similar in terms of checking for unread results.

To expand on this example further, maybe it's worth considering why mysql-connector-python doesn't require similar client-side timeouts. I think the answer is just because it expects server-side disconnections and handles them gracefully. IMO, that's what go-sql-driver/mysql should also do instead of treating connection loss as something that can be avoided.

@methane
Copy link
Member Author

methane commented Jul 3, 2023

This is why I say we're not special. Other users will not be following the recommended timeouts, because they haven't needed to (thanks to CheckConnLiveness) - a recommendation is not the same thing as an important requirement. Other users will still hit #1392, #1393, and #1394 even when following the recommendation, because other server errors and disconnections can happen to anyone.

It must be very rare. There are no guarantee the disconnection happened during idle state.
User may lost connection just after sending COMMIT. Then user can not know the transaction was committed or not.

Noticing such bad configuration/network when it happend in idle connection is better for users. Network error must be verbose and noisy for users.

When I said "special", it didn't mean the error is rare. It meant user should notice about the error rather than hide it automatically.

@methane
Copy link
Member Author

methane commented Jul 6, 2023

I wonder if that situation could be improved by using unix.Poll() instead of syscall.Read()?

I tried to implement it.
#1456

If it works fine, no need to chose one from better error message and CheckConnLiveness.

@jsha
Copy link

jsha commented Jul 19, 2023

Reading #934, can I ask: what command were you using to detect variables escaping to the heap? What is its output today? Also, does the impact show up on a benchmark that covers real query traffic? I know volume of allocations is important for a Go program, but I am curious to know if doing a query has very low allocation count in general.

I work on https://github.com/letsencrypt/boulder/, and in my opinion CheckConnLiveness is a good default for us. One example of why: we use ProxySQL, which sometimes terminates connections for reasons other than timeout. Yes, we can experience errors when that happens in the middle of a query rather than an idle connection, so we need to implement application-level retries anyhow; but I think CheckConnLiveness is greatly reducing our error rate for not very much cost.

As one more point of reference, the Go HTTP client fires off a goroutine running a readLoop when a connection is first established. That has a similar effect to the CheckConnLiveness code in go-sql-driver/mysql. When a server closes a connection, the client detects that and removes it from the pool. That means in most cases the client can avoid sending a request on a connection that the OS already knows to be closed.

@methane
Copy link
Member Author

methane commented Jul 20, 2023

Reading #934, can I ask: what command were you using to detect variables escaping to the heap? What is its output today?

I don't remember. Google "golang compiler option escape".

Also, does the impact show up on a benchmark that covers real query traffic? I know volume of allocations is important for a Go program, but I am curious to know if doing a query has very low allocation count in general.

I don't know. Performance is important but maintenance cost & additional external dependency is the main reason I don't want to enable it by default.

One example of why: we use ProxySQL, which sometimes terminates connections for reasons other than timeout.

Why and when ProxySQL closes connection?

As one more point of reference, the Go HTTP client fires off a goroutine running a readLoop when a connection is first established.

I don't want to do it, because additional goroutine makes mentenance hard.
This project is suffered by lack of maintainer already. (I am the only active maintainer but I don't have admin right of this repository.)
Do not expect we remember full architecuture of our code. When we review code, we need to read and learn much code. Additional complexity reduces our velocity and quality.

I want to pay my time and mental energy to other features, like compression.

@methane methane closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants