Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server sometimes returns "rpc error: code = Unknown" instead of "code = codees.DeadlineExceeded" #5155

Closed
jeremija opened this issue Jan 20, 2022 · 4 comments · Fixed by #5156

Comments

@jeremija
Copy link

What version of gRPC are you using?

Tested on 1.41.0 and master commit f93e8e6.

What version of Go are you using (go version)?

1.17.2

What operating system (Linux, Windows, …) and version?

Linux, Ubuntu 20.04 LTS

What did you do?

I discovered a flaky unit test in a private code base. The test performs the following steps:

  1. Starts a GRPC Server.
  2. Initializes a GRPC Client that connects to this server.
  3. Client calls a method with ctx, _ = context.WithTimeout(context.Background(), 50 * time.Millisecond). This method returns a stream.
  4. Client keeps calling stream.Recv in a loop until it receives an error.

What did you expect to see?

In >99% of the cases we can successfully assert that status.Code(err) == codes.DeadlineExceeded.

The error string is: rpc error: code = DeadlineExceeded desc = context deadline exceeded.

What did you see instead?

In <1% of the cases we see status.Code(err) is equal to codes.Unknown.

The error string is: rpc error: code = Unknown desc = context deadline exceeded.


It's very hard to reproduce this bug, I think it's because there are a few different places where the errors are handled concurrently, and it probably also depends on the Go's scheduler.

However, I believe I was able to track down the cause in the code. After I replace one single line I'm no longer able to reproduce this issue, even after over 1000 test runs:

diff --git a/server.go b/server.go
index eadf9e05..330154b2 100644
--- a/server.go
+++ b/server.go
@@ -1549,7 +1549,7 @@ func (s *Server) processStreamingRPC(t transport.ServerTransport, stream *transp
 	if appErr != nil {
 		appStatus, ok := status.FromError(appErr)
 		if !ok {
-			appStatus = status.New(codes.Unknown, appErr.Error())
+			appStatus, _ = status.FromError(toRPCErr(appErr))
 			appErr = appStatus.Err()
 		}
 		if trInfo != nil {

grpc-go/server.go

Line 1552 in f93e8e6

appStatus = status.New(codes.Unknown, appErr.Error())

@menghanl
Copy link
Contributor

I believe the reason is that the appErr returned by your service handlers is the error context.DeadlineExceeded (which is also part of why it's hard to reproduce)

We expect the errors returned by service handlers are status errors (errors produced by the status package).
But I also agree that context errors might need some special handling.

I will update this issue later when we have a decision.

@menghanl
Copy link
Contributor

Please see fix in #5156

@jeremija
Copy link
Author

Wow thanks for a prompt response and fix @menghanl! 🙂

I'll take a look and post back first thing tomorrow!

@jeremija
Copy link
Author

jeremija commented Jan 21, 2022

I just ran your commit with the affected test in my codebase with --count=10000 --failfast with context timeout of 50ms. It would previously error out pretty quickly, but now I don't see the errors anymore, so LGTM! Thanks!

We expect the errors returned by service handlers are status errors (errors produced by the status package).
But I also agree that context errors might need some special handling.

TIL - it looks like we have a ChainStreamInterceptor that converts the context errors to GRPC status errors for us, but these interceptors weren't being used in the tests. If I add this interceptor to our test suite, I can no longer reproduce the flaky test, even with the old commit.

Thanks again for all your help!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants