New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2 ERR_HTTP2_SESSION_ERROR #2647
Comments
This message means that the server abruptly closed the connection with error code 2, which means "internal error". This in turn caused the requests to fail. The client is just reporting the error from the server. My guess is that the server gets overloaded by the number of concurrent requests you are sending. |
@murgatroid99 Hi, the problem I can see it comes from the client, because I did the same tests having the server deployed in kubernetes with several pods with the capacity to take the high load. I see the server replying to all the requests, but suddenly the client shutdown with this error. When I have a concurrency problem, I receive the error in the client "RPC limit exceeded", but in this case I always get a response, never that error. It looks like the problem is when the client increase the amount of requests it sends. |
First, before we investigate this further, you should verify that this error still occurs on the latest @grpc/grpc-js 1.9.x version. The error message says that the HTTP/2 session ended with error code 2. The gRPC client code is not deliberately ending sessions with that code, so the default assumption is that that error comes from the server/proxy, but it is possible that the Node runtime on the client is generating that error. This can be verified with a tcpdump of the traffic the client is sending and receiving. That will show definitively whether or not the server sent that error. |
@murgatroid99 Yes, this error is also happening with node 18.16.0, 20.11.0, 21.6.0, and also with grpc-js 1.9.14 |
@murgatroid99 Here I have more logs: The envoy side car in the client shows:
Which looks like is saying the client is sending the goway due to a callback function failed and proceed to reset the stream Then in the application I see these relevant lines:
Where I see also here the line: "Http2Session client: submitting goaway" which looks like show the client is sending the goaway confirming the envoy side car log. This log line comes from: And the relevant envoy logs from the client and server I see the following (there are a lot of requests like these): ENVOY CLIENT LOG - Request: 7448123a-6637-4550-80d0-ea519c22cfa0 2024-01-27T07:18:14.861767964Z stdout F {"path":"/rpc.bookshop.v1.InventoryService/GetBookList","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","downstream_local_address":"10.96.13.98:8081","upstream_host":"10.244.1.35:8081","start_time":"2024-01-27T07:18:09.524Z","downstream_peer_uri_san":null,"request_id":"7448123a-6637-4550-80d0-ea519c22cfa0","dd.trace_id":null,"upstream_transport_failure_reason":null,"upstream_service_time":null,"requested_server_name":null,"response_code_details":"codec_error:The_user_callback_function_failed","bytes_received":5,"route_name":"default","response_flags":"DPE","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","duration":5333,"user_agent":"grpc-node-js/1.9.14","x_forwarded_for":null,"response_code":0,"x_envoy_external_address":null,"x_forwarded_proto":"http","connection_termination_details":null,"protocol":"HTTP/2","upstream_local_address":"10.244.1.36:56188","x_forwarded_client_cert":null,"downstream_remote_address":"10.244.1.36:34518","method":"POST","bytes_sent":0} The client error seems to say there is a codec error. The envoy flag DPE means "The downstream request had an HTTP protocol error", and because this is the client envoy, the downstream is the nodejs application client container using the grpc-js having the protocol error. ENVOY SERVER LOG - Request: 7448123a-6637-4550-80d0-ea519c22cfa0 2024-01-27T07:18:14.842076464Z stdout F {"upstream_service_time":"5175","x_envoy_external_address":null,"downstream_peer_uri_san":"spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","dd.trace_id":null,"protocol":"HTTP/2","route_name":"default","response_code_details":"via_upstream","x_forwarded_for":null,"response_flags":"-","duration":5227,"requested_server_name":"outbound_.8081_._.grpc-server.mcaruso-poc.svc.cluster.local","start_time":"2024-01-27T07:18:09.599Z","path":"/rpc.bookshop.v1.InventoryService/GetBookList","connection_termination_details":null,"response_code":200,"user_agent":"grpc-node-js/1.9.14","downstream_remote_address":"10.244.1.36:56188","bytes_received":5,"upstream_cluster":"inbound|8081||","downstream_local_address":"10.244.1.35:8081","method":"POST","upstream_transport_failure_reason":null,"upstream_local_address":"127.0.0.6:42447","x_forwarded_client_cert":"By=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-server;Hash=31094800fb20c490cceba2bd1ea0b3ed257f16934f4d2fc8858b32d79ad90e9e;Subject="";URI=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","bytes_sent":106,"upstream_host":"10.244.1.35:8081","request_id":"7448123a-6637-4550-80d0-ea519c22cfa0","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","x_forwarded_proto":"http"} ENVOY CLIENT LOG - Request 9664814f-e11d-43e4-ba61-df8293f63552 2024-01-27T07:18:14.825414881Z stdout F {"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","protocol":"HTTP/2","path":"/rpc.bookshop.v1.InventoryService/GetBookList","dd.trace_id":null,"route_name":"default","response_code_details":"codec_error:The_user_callback_function_failed","bytes_received":5,"requested_server_name":null,"upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","start_time":"2024-01-27T07:18:09.576Z","downstream_peer_uri_san":null,"user_agent":"grpc-node-js/1.9.14","x_forwarded_proto":"http","bytes_sent":0,"x_envoy_external_address":null,"request_id":"9664814f-e11d-43e4-ba61-df8293f63552","x_forwarded_for":null,"upstream_transport_failure_reason":null,"method":"POST","response_flags":"DPE","duration":5243,"x_forwarded_client_cert":null,"response_code":0,"upstream_service_time":null,"upstream_host":"10.244.1.35:8081","downstream_local_address":"10.96.13.98:8081","downstream_remote_address":"10.244.1.36:34518","upstream_local_address":"10.244.1.36:56188","connection_termination_details":null} ENVOY SERVER LOG- Request 9664814f-e11d-43e4-ba61-df8293f63552 2024-01-27T07:18:14.894895339Z stdout F {"connection_termination_details":null,"method":"POST","request_id":"9664814f-e11d-43e4-ba61-df8293f63552","upstream_service_time":null,"dd.trace_id":null,"upstream_host":"10.244.1.35:8081","bytes_received":5,"route_name":"default","requested_server_name":"outbound_.8081_._.grpc-server.mcaruso-poc.svc.cluster.local","x_forwarded_proto":"http","bytes_sent":0,"downstream_remote_address":"10.244.1.36:56188","upstream_transport_failure_reason":null,"start_time":"2024-01-27T07:18:09.641Z","response_code":0,"upstream_local_address":"127.0.0.6:42447","x_forwarded_for":null,"response_flags":"-","response_code_details":"http2.remote_reset","x_forwarded_client_cert":"By=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-server;Hash=31094800fb20c490cceba2bd1ea0b3ed257f16934f4d2fc8858b32d79ad90e9e;Subject="";URI=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","x_envoy_external_address":null,"downstream_peer_uri_san":"spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","downstream_local_address":"10.244.1.35:8081","upstream_cluster":"inbound|8081||","path":"/rpc.bookshop.v1.InventoryService/GetBookList","protocol":"HTTP/2","user_agent":"grpc-node-js/1.9.14","duration":5251,"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081"} Here we see in the server response that the remote client reseted the stream. |
As I commented on the corresponding Istio issue, the first line of the client log shows that the client received a goaway before it sent one. So the primary cause of the connection closing is the client receiving a goaway, not sending one. |
@murgatroid99 That is not the issue, we can see there that the envoy in the client side is generating the goaway: But the goaway here is not the problem, the goaway happens, because there are a bunch of requests with errors. I had thousands of requests with the same error in the envoy client side: "response_code_details":"codec_error:The_user_callback_function_failed",,"response_flags":"DPE"," The goaway is the end result of this, becasue there are several errors, then the client it proceeds to reset all the streams. And we now is the client resetting the streams, because the envoy server side car shows :"response_code_details":"http2.remote_reset", We need to focus on the codec error, not in the goaway. |
The client log shows that the client process received a goaway and the cited error message indicates that the requests failed because the client process received that goaway. Do you have any logs that show that any requests failed on the client before the client received that goaway? |
@murgatroid99 First I see a lot of requests like this in envoy client side where they are ok: 2024-01-27T07:18:14.817417547Z stdout F {"bytes_received":5,"connection_termination_details":null,"downstream_peer_uri_san":null,"upstream_local_address":"10.244.1.36:56188","method":"POST","x_envoy_external_address":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","upstream_host":"10.244.1.35:8081","requested_server_name":null,"start_time":"2024-01-27T07:18:09.537Z","path":"/rpc.bookshop.v1.InventoryService/GetBookList","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","request_id":"4d2ba9ed-e3dd-4cd2-8ef4-87369be87a99","protocol":"HTTP/2","downstream_remote_address":"10.244.1.36:34518","x_forwarded_client_cert":null,"route_name":"default","user_agent":"grpc-node-js/1.9.14","x_forwarded_proto":"http","response_code":200,"upstream_service_time":"5215","bytes_sent":106,"duration":5269,"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","response_flags":"-","x_forwarded_for":null,"upstream_transport_failure_reason":null,"response_code_details":"via_upstream"} Then, at some point after a lot of successfull requests like the one above, I see the goaway is sent by the envoy client side: 2024-01-27T07:18:14.817867Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1208 [Tags: "ConnectionId":"7"] sent goaway code=2 thread=25 And after that I see a lot of requests in the envoy client side with the codec error like these: 2024-01-27T07:18:14.825341714Z stdout F {"bytes_sent":0,"response_flags":"DPE","user_agent":"grpc-node-js/1.9.14","upstream_service_time":null,"protocol":"HTTP/2","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","duration":5240,"upstream_host":"10.244.1.35:8081","x_envoy_external_address":null,"method":"POST","downstream_peer_uri_san":null,"bytes_received":5,"x_forwarded_for":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","start_time":"2024-01-27T07:18:09.577Z","upstream_local_address":"10.244.1.36:56188","requested_server_name":null,"connection_termination_details":null,"response_code_details":"codec_error:The_user_callback_function_failed","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","downstream_remote_address":"10.244.1.36:34518","upstream_transport_failure_reason":null,"path":"/rpc.bookshop.v1.InventoryService/GetBookList","response_code":0,"request_id":"23cf4d24-9caa-4d24-b3ca-a136b81ae46d","route_name":"default","x_forwarded_proto":"http","x_forwarded_client_cert":null} And after that, the client app receives the goaway, and closes the connection (here the app container logs): 2024-01-27T07:18:14.951034047Z stderr F HTTP2 18: Http2Session client: goaway 2 received [last stream id: 3999] The server logs doesnt show anything about a goaway, this all happens inside the pod of the client, between the envoy and the app container. |
This is the repository for the |
Ok, makes sense. If I try this with Istio 1.18.5, this does not happen. It only happens with Istio 1.19.X and up. |
The only thing that calls my attention in the grpc-js library is that when I search this codec_error: The_user_callback_function_failed, I get this:
And I see in the client node app erroring in the callback function: 2024-01-27T07:18:14.951058131Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING |
I don't think there is any relation between those things. The Node log there shows the code providing an error to a callback, not an error occurring in a callback. |
The reason is this silly rate limiter in node.js, 50121 #define NGHTTP2_DEFAULT_STREAM_RESET_BURST 1000 |
@bbit8 I don't see any relation between that rate limit and this bug. This bug is about requests closing because of GOAWAYs, not any large volume of RST_STREAMs. |
@murgatroid99 In fact is the opposite of what you say, I see a tons of RST_STREAMs. I think is related to what @bbit8 is mentioning. v1.9.14 18 | subchannel_call | [7722] HTTP/2 stream closed with code 2 Those lines are everywhere. The goaway happens because there was an error previously and it is related to the RST_STREAMs. The goaway log happens after I see a huge volume of RST STREAM errors. |
The GOAWAY is a prevention of envoy side car in istio, when it receives a lot of RST_STEAM from the grpc client, it consider that it is an attack, so it sends a goway to the client. But the initial problem here is why the client is sending a lot of RST_STREAMS, and it is because there is an error in the client side, it might be the grps-js or the node js, or what @bbit8 is mentioning. |
This is the property of envoy that do this "envoy.restart_features.send_goaway_for_premature_rst_streams" when it sees a lot of premature reset streams, it sends a go away. And this is what is happening here. |
To be clear, those logs do not correspond to actual RST_STREAM frames on the wire. They are synthetic RST_FRAME events generated by Node that correspond to the GOAWAY. |
@mcaruso85 @murgatroid99 I didn't dive deep. I just decreased node.js version from 18.18.2 to 18.18.1. Problem Solved! Something wrong with the rate limiter in nghttp2. |
For the record, the purpose of that linked patch in Node.js was to fix an HTTP/2 protocol vulnerability. I would not advise downgrading below that patch. @bbit8 the original issue reports that this problem occurs on Node 18.16.0, so if your problem was fixed by downgrading that way, it is likely different than the error originally reported here. |
Yes @bbit8 that didnt work for me, still fails with that version. |
This is a very general error: a connection closed because of an internal server error. There are probably a variety of causes that result in the same error. |
This problem may be related to #2625. We are trying out a fix for that in grpc-js 1.10.4, so I recommend checking that out. |
@murgatroid99 agree, initial testing shows that We can also resolve it by switching envoy to use the |
@murgatroid99 my initial tests shows that 1.10.4 also resolves this for us. I'll report back too once I have it in production. Thanks for your help ! |
This is a very general error. It says "Internal server error" in the text because one of the causes is an internal server error. I don't know why you're still seeing it, but the cause of your errors may be entirely unrelated to the one the patch in 1.10.4 was trying to address. |
I am also still receiving |
Problem description
What I'm doing?
I have a basic grpc js client and a grpc python server.
The grpc client it does 60 iterations, in each iteration it fires 300 grpc requests in parallel to the grpc server. The server processing is pretty simple, it sleeps 5 seconds and it respond success.
What I expected to happen?
I expect to finish all 60 iterations.
What is actually happening?
The script does not reach the 60 iterations. It starts to complete iterations with all responses success, but approx at the iteration 9 It fails with the following error:
Environment
Docker image: node:18.15.0-slim
Package name and version: "@grpc/grpc-js": "1.8.21"
Eks 1.25 - Istio 1.19.4
This error is also happening with node 18.16.0, 20.11.0, 21.6.0, and also with grpc-js 1.9.14
Additional context
If I do 100 requests in parallel, it finishes all the 60 iterations, the problem starts when I increase the number of requests in parallel. If I do 600 requests in parallel instead of 300, it throws the error even earlier in the iteration 4 approx. If I do 1000 it fails with the same error in the iteration 1.
The text was updated successfully, but these errors were encountered: