Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

jecknig · 2022-11-07T16:21:51Z

What version of gRPC and what language are you using?

grpcio version 1.47.0
Python version 3.8.10

What operating system (Linux, Windows,...) and version?

Docker image: python:3.8-slim-bullseye

What runtime / compiler are you using (e.g. python version or version of gcc)

Docker container is running on Google Kubernetes Engine, Version 1.22.15-gke.100, in zone europe-west4-a.

What did you do?

We have a GRPC API deployed on Kubernetes, which uses the gRPC AsyncIO API, and defines two simple RPCs, one which serves around 200 requests per minute, per replica, another which is just called by a readiness probe, once per 5 seconds and per replica. 2 Replica are deployed.

The API was running fine for about a year. But recently, we had an incident, in which both replica logged the same error message, almost at the same time:

Task exception was never retrieved
future: <Task finished name='Task-105440' coro=<<coroutine without __name__>()> exception=ExecuteBatchError('Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x7fe69cc0c4a0>, <grpc._cython.cygrpc.SendMessageOperation object at 0x7fe69cb6f5b0>, <grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x7fe69ca84ee0>)')>
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 705, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 796, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 452, in _finish_handler_with_unary_response
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 98, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x7fe69cc0c4a0>, <grpc._cython.cygrpc.SendMessageOperation object at 0x7fe69cb6f5b0>, <grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x7fe69ca84ee0>)

What did you expect to see?

I would have expected one of these two things to happen:

The exception is re-raised by grpcio to the surrounding python code, so that it can decide how to handle the exception (e.g. don't catch it, and let Kubernetes restart the pod as a result)
The exception is handled gracefully by grpcio, and the API continues serving requests as before, with the same response times as before.

What did you see instead?

The API did continue serving requests, but at permanently degraded performance (i.e. much slower response times). The incident could only be resolved by restarting the API pods manually.

Anything else we should know about your project / environment?

Unfortunately, I can't give detailed instructions on how to reproduce the exact circumstances of this bug, since it happened more or less randomly for us, after the API had already been running fine for around a year. However, the bug occured almost simultaneously
in both replica of the API. This leads us to believe, it was some infrastructure-related issue that triggered the exception.

Possibly related to #31527 or #31043.

The text was updated successfully, but these errors were encountered:

gnossen · 2022-12-16T20:15:50Z

@XuanWang-Amos Ping on this.

XuanWang-Amos · 2022-12-21T19:49:40Z

I'm currently investigating this issue, looks like this error is on server side, will need some time to dig into this.

njhill · 2022-12-27T05:06:42Z

Just a note to say that I also encountered this recently with 1.51.1 and python 3.9.

XuanWang-Amos · 2023-01-10T17:45:24Z

The error message Failed "execute_batch" indicates that we encountered some issue in gRPC core and it failed to process the operations passed from gRPC python to gRPC core, in order to further debug this issue, we'll need additional logs from core.

To everyone who have the similar issue, please include two environment flags while starting gRPC and paste logs so we can help further debug:
GRPC_VERBOSITY=debug GRPC_TRACE=all,-timer,-timer_check

In the meanwhile, I'll create a PR to enhance the error message so we can have more information on python layer too.

njhill · 2023-01-20T00:02:45Z

@XuanWang-Amos I was able to repro with the extra debug turned on, please see attached log. I'm pretty sure this is related to the client closing/abandoning the call early.

pygrpc-debug.log.gz

AMontagu · 2023-01-21T12:19:12Z

I can confirm that we have this issue whit grpc web when client reload web page while the stream is running

XuanWang-Amos · 2023-01-23T23:37:53Z

Thanks for the log, looks like client received RST_STREAM with error code 8:

UNKNOWN:Error received from peer unix: {grpc_message:"Received RST_STREAM with error code 8", grpc_status:1, created_time:"2023-01-19T23:19:28.385297862+00:00"}

RST_STREAM with error code 8 should be mapped to CANCELLED when sent by a server, from our code, we handle it by throwing an ExecuteBatchError here, so far looks correct, but when we process this error in server.pyx.pxi, we only considered client side cancel and raising it otherwise, thus this Exception was left unhandled in event loop.

We'll discuss internally to see how should we proceed from here.

cprajakta · 2023-03-21T09:52:24Z

@XuanWang-Amos - Any updates on this? We are also facing similar issue on our end as mentioned in the description.

XuanWang-Amos · 2023-04-06T17:04:36Z

A PR was merged so that we'll no longer throw ExecuteBatchError: #32551.

But it's unclear if that will also fix the degraded performance issue, I'm adding requires reporter action label so it will be automatically close in one month, but feel free to comment here if the performance issue still exist.

XuanWang-Amos · 2023-08-24T22:33:57Z

Looks like it's not happening anymore, closing this issue now.

Again, feel free to comment here if the performance issue still exist.

jecknig added kind/bug lang/Python priority/P2 labels Nov 7, 2022

jecknig assigned gnossen and XuanWang-Amos Nov 7, 2022

gnossen removed their assignment Nov 29, 2022

XuanWang-Amos added the disposition/requires reporter action label Jan 10, 2023

This was referenced Jan 10, 2023

Asyncio task exception upon unhandled ExecuteBatchError exception #31043

Closed

Exception not retrieved: grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch #30984

Closed

grpc-bot removed the disposition/requires reporter action label Jan 21, 2023

gnossen added the disposition/never stale label Feb 22, 2023

tjohnson31415 mentioned this issue Mar 10, 2023

ExecutionBatchError: Failed "execute_batch": kserve/modelmesh-serving#343

Open

XuanWang-Amos added disposition/requires reporter action and removed disposition/never stale labels Apr 6, 2023

grpc-bot removed the disposition/requires reporter action label Apr 7, 2023

XuanWang-Amos closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

jecknig commented Nov 7, 2022

gnossen commented Dec 16, 2022

XuanWang-Amos commented Dec 21, 2022

njhill commented Dec 27, 2022

XuanWang-Amos commented Jan 10, 2023

njhill commented Jan 20, 2023

AMontagu commented Jan 21, 2023

XuanWang-Amos commented Jan 23, 2023

cprajakta commented Mar 21, 2023

XuanWang-Amos commented Apr 6, 2023

XuanWang-Amos commented Aug 24, 2023

Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

Unhandled ExecuteBatchError leaves gRPC AsyncIO API in a permanently degraded state #31570

Comments

jecknig commented Nov 7, 2022

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you expect to see?

What did you see instead?

Anything else we should know about your project / environment?

gnossen commented Dec 16, 2022

XuanWang-Amos commented Dec 21, 2022

njhill commented Dec 27, 2022

XuanWang-Amos commented Jan 10, 2023

njhill commented Jan 20, 2023

AMontagu commented Jan 21, 2023

XuanWang-Amos commented Jan 23, 2023

cprajakta commented Mar 21, 2023

XuanWang-Amos commented Apr 6, 2023

XuanWang-Amos commented Aug 24, 2023