Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

Closed
DarrylWong opened this issue Mar 21, 2022 · 0 comments · Fixed by #78490
Closed
Labels
A-server-start-drain Pertains to server startup and shutdown sequences C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security

Comments

@DarrylWong
Copy link
Contributor

DarrylWong commented Mar 21, 2022

Upgrading google.golang.org/grpc to v1.45.0 causes the test TestDirectoryConnect/drain_connection to fail because of a change in how context.Canceled errors returned by server request handlers are returned to clients.

The test starts a sqlproxy server and a directory server. It then shuts down the test directory server, starts a second test directory server, and then calls Drain() on the second test directory server, expecting that eventually SQL connections will be drained.

This test fails since Drain() on the second TestDirectoryServer does not do anything as there is no event listener ever added to the TestDirectoryServer. No event listener is added because the goroutine in the sql proxy server responsible for calling WatchPods on the TestDirectoryServer

err := stopper.RunAsyncTask(ctx, "watch-pods-client", func(ctx context.Context) {

exits in response to the first TestDirectoryServer shutting down, rather than attempting to connect to the newly started TestDirectoryServer and registering a listener.

The goroutine exits because of the following lines in watchPods():

if grpcutil.IsContextCanceled(err) {
break
}

where IsContextCanceled() is:

func IsContextCanceled(err error) bool {
if s, ok := status.FromError(errors.UnwrapAll(err)); ok {
return s.Code() == codes.Canceled && s.Message() == context.Canceled.Error()
}
return false
}

In v1.44 of grpc and before, when a server-side handler returned a context.Canceled error gRPC would return a gRPC error with status Unknown. As of v1.45 (grpc/grpc-go#5156 ), it now returns an error with the status Canceled. As a result, IsContextCanceled() now returns true when the server side request handler returns a context.Canceled error, whereas it previously returned false.

In the TestDirectoryServer, we currently return context.Canceled in response to a quiescing stopper:

It appears that the IsContextCanceled check in the proxy server was likely intended to only catch cancellations of the local context, since all other errors experienced at that point results in starting up the watchPods() handler again.

This functionality is now broken as stopping the server in line 707 of TestDirectoryConnect/drain_connection incorrectly stops the watchPods() handler as stopping the server returns a context.Canceled error.

Jira issue: CRDB-14012

@DarrylWong DarrylWong added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-server-start-drain Pertains to server startup and shutdown sequences labels Mar 21, 2022
@stevendanna stevendanna added the T-server-and-security DB Server & Security label Mar 22, 2022
@blathers-crl blathers-crl bot added this to To do in DB Server & Security Mar 22, 2022
stevendanna added a commit to stevendanna/cockroach that referenced this issue Mar 25, 2022
Previously, we used grpcutil.IsContextCanceled to detect when a
returned gRPC error was the result of a context cancellation.

I believe that the intent of this code was to detect when the _local_
context was cancelled, indicating that we are shutting down and thus
the watch-pods-client goroutine should exit.

This works because the gRPC library converts a local context.Canceled
error into a gRPC error. And, in gRPC before 1.45, if a server handler
returned context.Canceled, the returned gRPC error would have
status.Unknown, and thus not trigger this exit behavior.

As of gRPC 1.45, however, a context.Canceled error returned by a
server handler will also result in a gRPC error with status.Canceled [0],
meaning that the previous code will force the goroutine to exit in
response to a server-side error. From my reading of this code, it
appears we want to retry all server-side errors.

To account for this, we now only break out of the retry loop if our
local context is done.

Further, I've changed the test directory server implementation to
return an arguably more appropriate error when it is shutting down.

Fixes cockroachdb#78197

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this issue Mar 25, 2022
Previously, we used grpcutil.IsContextCanceled to detect when a
returned gRPC error was the result of a context cancellation.

I believe that the intent of this code was to detect when the _local_
context was cancelled, indicating that we are shutting down and thus
the watch-pods-client goroutine should exit.

This works because the gRPC library converts a local context.Canceled
error into a gRPC error. And, in gRPC before 1.45, if a server handler
returned context.Canceled, the returned gRPC error would have
status.Unknown, and thus not trigger this exit behavior.

As of gRPC 1.45, however, a context.Canceled error returned by a
server handler will also result in a gRPC error with status.Canceled [0],
meaning that the previous code will force the goroutine to exit in
response to a server-side error. From my reading of this code, it
appears we want to retry all server-side errors.

To account for this, we now only break out of the retry loop if our
local context is done.

Further, I've changed the test directory server implementation to
return an arguably more appropriate error when it is shutting down.

Fixes cockroachdb#78197

Release note: None
craig bot pushed a commit that referenced this issue Mar 28, 2022
78241: kvserver: de-flake TestReplicaCircuitBreaker_RangeFeed r=erikgrinaker a=tbg

Fixes #76856.

Release note: None


78312: roachtest: improve debugging in transfer-leases r=erikgrinaker a=tbg

This test failed once and we weren't able to figure out why; having
the range status used by the test would've been useful.

Now this is saved and so the next time it fails we'll have more to
look at.

Closes #75438.

Release note: None


78422: roachtest: bump max wh for weekly tpccbench/nodes=12/cpu=16 r=srosenberg a=tbg

[It was maxing out, reliably.](https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D12%2Fcpu%3D16&tab=gce)

Release note: None


78490: sqlproxyccl: exit pod-watcher-client on local context cancellation r=jaylim-crl,darinpp a=stevendanna

Previously, we used grpcutil.IsContextCanceled to detect when a
returned gRPC error was the result of a context cancellation.

I believe that the intent of this code was to detect when the _local_
context was cancelled, indicating that we are shutting down and thus
the watch-pods-client goroutine should exit.

This works because the gRPC library converts a local context.Canceled
error into a gRPC error. And, in gRPC before 1.45, if a server handler
returned context.Canceled, the returned gRPC error would have
status.Unknown, and thus not trigger this exit behavior.

As of gRPC 1.45, however, a context.Canceled error returned by a
server handler will also result in a gRPC error with status.Canceled [0],
meaning that the previous code will force the goroutine to exit in
response to a server-side error. From my reading of this code, it
appears we want to retry all server-side errors.

To account for this, we now only break out of the retry loop if our
local context is done.

Further, I've changed the test directory server implementation to
return an arguably more appropriate error when it is shutting down.

Fixes #78197

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
@craig craig bot closed this as completed in aea1915 Mar 28, 2022
DB Server & Security automation moved this from To do to Done 21.2 Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-server-start-drain Pertains to server startup and shutdown sequences C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Done 21.2
Development

Successfully merging a pull request may close this issue.

2 participants