Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay closing connection on the server after all streams are completed #6957

Closed
wants to merge 4 commits into from

Conversation

s-matyukevich
Copy link
Contributor

We hit exactly the same bug that was previously reported in grpc-java grpc/grpc-java#9566

It was also fixed in http2 go standard lib golang/net@cd69bc3#diff-75c1d03aa4d9dfad1fe759ba9fce0462d6e88f2e33242425c4563d4ffb454951

TL;DR is that when server sends large responses and configures MaxConnectionAge, sometimes it closes the connection too early after finishing processing all active streams (the client is not done reading all data from the connection yet, the data is still stored inside TCP kernel buffer on the server) In such cases the client receives RST and gets the following error:

error while calling server error="rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: read tcp 10.131.194.232:41856->10.132.142.4:3000: read: connection reset by peer\", received prior goaway: code: NO_ERROR, debug data: \"max_age\""

The fix is very simple: sleep for 1 sec after finishing processing all streams and before closing the connection.

I can reliably reproduce this error in our staging environment, and I verified that the proposed fix actually works.

I am still working on a unit test that reproduces the error locally. My plan is to modify this test and make the server send large responses, but I'll appreciate any guidance on how we should test this.

I can also make the sleep duration configurable if you prefer.

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

Merging #6957 (cac6dc2) into master (a3f5ed6) will increase coverage by 0.14%.
Report is 9 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6957      +/-   ##
==========================================
+ Coverage   83.61%   83.75%   +0.14%     
==========================================
  Files         287      287              
  Lines       30929    30938       +9     
==========================================
+ Hits        25860    25911      +51     
+ Misses       3998     3960      -38     
+ Partials     1071     1067       -4     
Files Coverage Δ
server.go 81.32% <100.00%> (-0.74%) ⬇️

... and 17 files with indirect coverage changes


if testing.Testing() {
goAwayTimeout = 0
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests that use grpc.NewServer and define very short timeouts should not be affected.

@@ -83,6 +86,10 @@ func init() {
internal.BinaryLogger = binaryLogger
internal.JoinServerOptions = newJoinServerOption
internal.RecvBufferPool = recvBufferPool

if testing.Testing() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like testing.Testing() was only made available since go.1.21.0 https://pkg.go.dev/testing#Testing

We maintain a backward compatibility to the last 3 versions of go. (i.e., v1.19+)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions what should I do about it? Options I can think of:

  • add a hack to determine if this is a testing build, something like this https://stackoverflow.com/questions/14249217/how-do-i-know-im-running-within-go-test
  • Don't enable the delay by default, add a server option to activate it
  • Enable delay by default and ignore the fact that it could break some tests that expect the server to be shut-down faster then 1 sec. Add a server option so that users can disable (or maybe configure?) the delay in their tests.

@dfawley
Copy link
Member

dfawley commented Feb 9, 2024

Oh this is especially nasty, and is likely the root cause behind #5358, which has confounded me for some time.

So the root cause of the problem is that if the server's receive buffer is not empty, then closing the connection results in RST instead of FIN.

https://www.freesoft.org/CIE/RFC/1122/99.htm

If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost.

I have a slightly different idea here, for a way to fix this without incurring a fixed delay on everything. Let me play around with it a bit.

@dfawley
Copy link
Member

dfawley commented Feb 9, 2024

Thank you for the PR. Please see #6977 which has a preferred fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants