Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

Closed
jecknig opened this issue Nov 7, 2022 · 10 comments

Comments

@jecknig
Copy link

jecknig commented Nov 7, 2022

What version of gRPC and what language are you using?

grpcio version 1.47.0
Python version 3.8.10

What operating system (Linux, Windows,...) and version?

Docker image: python:3.8-slim-bullseye

What runtime / compiler are you using (e.g. python version or version of gcc)

Docker container is running on Google Kubernetes Engine, Version 1.22.15-gke.100, in zone europe-west4-a.

What did you do?

We have a GRPC API deployed on Kubernetes, which uses the gRPC AsyncIO API, and defines two simple RPCs, one which serves around 200 requests per minute, per replica, another which is just called by a readiness probe, once per 5 seconds and per replica. 2 Replica are deployed.

The API was running fine for about a year. But recently, we had an incident, in which both replica logged the same error message, almost at the same time:

Task exception was never retrieved
future: <Task finished name='Task-105440' coro=<<coroutine without __name__>()> exception=ExecuteBatchError('Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x7fe69cc0c4a0>, <grpc._cython.cygrpc.SendMessageOperation object at 0x7fe69cb6f5b0>, <grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x7fe69ca84ee0>)')>
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 705, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 796, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 452, in _finish_handler_with_unary_response
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 98, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x7fe69cc0c4a0>, <grpc._cython.cygrpc.SendMessageOperation object at 0x7fe69cb6f5b0>, <grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x7fe69ca84ee0>)

What did you expect to see?

I would have expected one of these two things to happen:

  • The exception is re-raised by grpcio to the surrounding python code, so that it can decide how to handle the exception (e.g. don't catch it, and let Kubernetes restart the pod as a result)
  • The exception is handled gracefully by grpcio, and the API continues serving requests as before, with the same response times as before.

What did you see instead?

  • The API did continue serving requests, but at permanently degraded performance (i.e. much slower response times). The incident could only be resolved by restarting the API pods manually.

Anything else we should know about your project / environment?

Unfortunately, I can't give detailed instructions on how to reproduce the exact circumstances of this bug, since it happened more or less randomly for us, after the API had already been running fine for around a year. However, the bug occured almost simultaneously
in both replica of the API. This leads us to believe, it was some infrastructure-related issue that triggered the exception.

Possibly related to #31527 or #31043.

@gnossen
Copy link
Contributor

gnossen commented Dec 16, 2022

@XuanWang-Amos Ping on this.

@XuanWang-Amos
Copy link
Contributor

I'm currently investigating this issue, looks like this error is on server side, will need some time to dig into this.

@njhill
Copy link
Contributor

njhill commented Dec 27, 2022

Just a note to say that I also encountered this recently with 1.51.1 and python 3.9.

@XuanWang-Amos
Copy link
Contributor

The error message Failed "execute_batch" indicates that we encountered some issue in gRPC core and it failed to process the operations passed from gRPC python to gRPC core, in order to further debug this issue, we'll need additional logs from core.

To everyone who have the similar issue, please include two environment flags while starting gRPC and paste logs so we can help further debug:
GRPC_VERBOSITY=debug GRPC_TRACE=all,-timer,-timer_check

In the meanwhile, I'll create a PR to enhance the error message so we can have more information on python layer too.

@njhill
Copy link
Contributor

njhill commented Jan 20, 2023

@XuanWang-Amos I was able to repro with the extra debug turned on, please see attached log. I'm pretty sure this is related to the client closing/abandoning the call early.

pygrpc-debug.log.gz

@AMontagu
Copy link

I can confirm that we have this issue whit grpc web when client reload web page while the stream is running

@XuanWang-Amos
Copy link
Contributor

Thanks for the log, looks like client received RST_STREAM with error code 8:

UNKNOWN:Error received from peer unix: {grpc_message:"Received RST_STREAM with error code 8", grpc_status:1, created_time:"2023-01-19T23:19:28.385297862+00:00"}

RST_STREAM with error code 8 should be mapped to CANCELLED when sent by a server, from our code, we handle it by throwing an ExecuteBatchError here, so far looks correct, but when we process this error in server.pyx.pxi, we only considered client side cancel and raising it otherwise, thus this Exception was left unhandled in event loop.

We'll discuss internally to see how should we proceed from here.

@cprajakta
Copy link

@XuanWang-Amos - Any updates on this? We are also facing similar issue on our end as mentioned in the description.

@XuanWang-Amos
Copy link
Contributor

A PR was merged so that we'll no longer throw ExecuteBatchError: #32551.

But it's unclear if that will also fix the degraded performance issue, I'm adding requires reporter action label so it will be automatically close in one month, but feel free to comment here if the performance issue still exist.

@XuanWang-Amos
Copy link
Contributor

Looks like it's not happening anymore, closing this issue now.

Again, feel free to comment here if the performance issue still exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants