server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

DarrylWong · 2022-03-21T18:27:25Z

Upgrading google.golang.org/grpc to v1.45.0 causes the test TestDirectoryConnect/drain_connection to fail because of a change in how context.Canceled errors returned by server request handlers are returned to clients.

The test starts a sqlproxy server and a directory server. It then shuts down the test directory server, starts a second test directory server, and then calls Drain() on the second test directory server, expecting that eventually SQL connections will be drained.

This test fails since Drain() on the second TestDirectoryServer does not do anything as there is no event listener ever added to the TestDirectoryServer. No event listener is added because the goroutine in the sql proxy server responsible for calling WatchPods on the TestDirectoryServer

cockroach/pkg/ccl/sqlproxyccl/tenant/directory.go

Line 310 in 4e7ba29

    
           err := stopper.RunAsyncTask(ctx, "watch-pods-client", func(ctx context.Context) {

exits in response to the first TestDirectoryServer shutting down, rather than attempting to connect to the newly started TestDirectoryServer and registering a listener.

The goroutine exits because of the following lines in watchPods():

cockroach/pkg/ccl/sqlproxyccl/tenant/directory.go

Lines 343 to 345 in 4e7ba29

    
           if grpcutil.IsContextCanceled(err) { 
        
           	break 
        
           }

where IsContextCanceled() is:

cockroach/pkg/util/grpcutil/grpc_util.go

Lines 62 to 67 in 4e7ba29

    
           func IsContextCanceled(err error) bool { 
        
           	if s, ok := status.FromError(errors.UnwrapAll(err)); ok { 
        
           		return s.Code() == codes.Canceled && s.Message() == context.Canceled.Error() 
        
           	} 
        
           	return false 
        
           }

In v1.44 of grpc and before, when a server-side handler returned a context.Canceled error gRPC would return a gRPC error with status Unknown. As of v1.45 (grpc/grpc-go#5156 ), it now returns an error with the status Canceled. As a result, IsContextCanceled() now returns true when the server side request handler returns a context.Canceled error, whereas it previously returned false.

In the TestDirectoryServer, we currently return context.Canceled in response to a quiescing stopper:

cockroach/pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go

Line 213 in 4e7ba29

return context.Canceled

It appears that the IsContextCanceled check in the proxy server was likely intended to only catch cancellations of the local context, since all other errors experienced at that point results in starting up the watchPods() handler again.

This functionality is now broken as stopping the server in line 707 of TestDirectoryConnect/drain_connection incorrectly stops the watchPods() handler as stopping the server returns a context.Canceled error.

Jira issue: CRDB-14012

The text was updated successfully, but these errors were encountered:

Previously, we used grpcutil.IsContextCanceled to detect when a returned gRPC error was the result of a context cancellation. I believe that the intent of this code was to detect when the _local_ context was cancelled, indicating that we are shutting down and thus the watch-pods-client goroutine should exit. This works because the gRPC library converts a local context.Canceled error into a gRPC error. And, in gRPC before 1.45, if a server handler returned context.Canceled, the returned gRPC error would have status.Unknown, and thus not trigger this exit behavior. As of gRPC 1.45, however, a context.Canceled error returned by a server handler will also result in a gRPC error with status.Canceled [0], meaning that the previous code will force the goroutine to exit in response to a server-side error. From my reading of this code, it appears we want to retry all server-side errors. To account for this, we now only break out of the retry loop if our local context is done. Further, I've changed the test directory server implementation to return an arguably more appropriate error when it is shutting down. Fixes cockroachdb#78197 Release note: None

78241: kvserver: de-flake TestReplicaCircuitBreaker_RangeFeed r=erikgrinaker a=tbg Fixes #76856. Release note: None 78312: roachtest: improve debugging in transfer-leases r=erikgrinaker a=tbg This test failed once and we weren't able to figure out why; having the range status used by the test would've been useful. Now this is saved and so the next time it fails we'll have more to look at. Closes #75438. Release note: None 78422: roachtest: bump max wh for weekly tpccbench/nodes=12/cpu=16 r=srosenberg a=tbg [It was maxing out, reliably.](https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D12%2Fcpu%3D16&tab=gce) Release note: None 78490: sqlproxyccl: exit pod-watcher-client on local context cancellation r=jaylim-crl,darinpp a=stevendanna Previously, we used grpcutil.IsContextCanceled to detect when a returned gRPC error was the result of a context cancellation. I believe that the intent of this code was to detect when the _local_ context was cancelled, indicating that we are shutting down and thus the watch-pods-client goroutine should exit. This works because the gRPC library converts a local context.Canceled error into a gRPC error. And, in gRPC before 1.45, if a server handler returned context.Canceled, the returned gRPC error would have status.Unknown, and thus not trigger this exit behavior. As of gRPC 1.45, however, a context.Canceled error returned by a server handler will also result in a gRPC error with status.Canceled [0], meaning that the previous code will force the goroutine to exit in response to a server-side error. From my reading of this code, it appears we want to retry all server-side errors. To account for this, we now only break out of the retry loop if our local context is done. Further, I've changed the test directory server implementation to return an arguably more appropriate error when it is shutting down. Fixes #78197 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com>

DarrylWong added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-server-start-drain Pertains to server startup and shutdown sequences labels Mar 21, 2022

stevendanna added the T-server-and-security DB Server & Security label Mar 22, 2022

blathers-crl bot added this to To do in DB Server & Security Mar 22, 2022

stevendanna mentioned this issue Mar 25, 2022

sqlproxyccl: exit pod-watcher-client on local context cancellation #78490

Merged

craig bot closed this as completed in aea1915 Mar 28, 2022

DB Server & Security automation moved this from To do to Done 21.2 Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

DarrylWong commented Mar 21, 2022 •

edited by stevendanna

server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197

Comments

DarrylWong commented Mar 21, 2022 • edited by stevendanna

DarrylWong commented Mar 21, 2022 •

edited by stevendanna