New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I perform performance testing, the grpc client receives Call canceled. Why is this? #2641
Comments
First, the cancelled calls are almost certainly unrelated to the idle timeout setting, or to the large number of timers you are seeing. The cancelled calls may be caused by the server getting overloaded and shedding load by cancelling requests. Based in the numbers in your report, this is causing errors at a rate of less than 0.1%, so I am not concerned. Second, there are a couple of possible causes for the accumulation of timers that you are seeing. Your "Timeout object" image shows a keepalive timer. There is only one keepalive timer per connection, so if that object is representative of all of them, that likely means that you are creating a lot of connections. If you are creating a separate client object for each request, that could cause that behavior. It would be better to create a single client and use it to make every request. On the other hand, if that keepalive timer is not representative, and most of the timers are idle timers, then you may instead be encountering a known inefficiency in how idle timers are handled. Specifically, if your test always makes a single request, waits for it to complete, and then makes another request, then that will cause a lot of timers to be cancelled. |
This could also be caused by creating many clients. Again, if you are doing that, you should instead create a single client and use it to make all requests. |
Thank you, I would like to confirm whether creating a service through new essentially means establishing a connection.
For instance, in my scenario, the client needs to connect to a dozen or even several tens of services. Could this cause any issues? Are there best practice solutions for creating clients for multiple services, or perhaps code examples? These occurrences indeed suggest that a large number of clients might be created, so your response is greatly appreciated. |
There is nothing wrong with creating a separate client object for each service you want to connect to. The problem is creating a separate client object for each request to the same service. I was only guessing that you were doing that, based on the information you provided. If you share your test code, we will have a better chance of finding the causes of the things you are observing. |
In addition, there are no grpc options set here. |
Those objects you see are socket channelz reference objects. There is one for each active connection. And with the way you construct your clients, they should be able to share connections, so there should only be one object in there. But the sizes seem to indicate that there are 2000 of them, and I don't see why that would happen from the code you shared. Can you share trace logs from running your test? You can get them by running the client with the environment variables |
This is the log when the service starts: This is the log from the stress test request: I hope this is useful to you |
First, those logs contain interspersed logs from two different processes, specifically with PIDs 62 and 69, which makes it a little harder to understand. The two logs are snapshots separated by an hour, so I can only guess what happened in between, but the object ID numbers in the logs seem to indicate that both processes reconnected to the server several hundred times in that hour. This is very unusual behavior, so I would recommend double checking your server configuration to see why it might be causing that. I found a code path that could occasionally leak those handles, but I don't think it accounts for 2000 of them. |
Because my service has 2 processes started, you will see the logs of the 2 processes; this log is from a period of time during the stress test, not all logs. How can you tell that have reconnected hundreds of times? Does it mean that the server connection is disconnected, causing the client to reconnect? If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks. I don’t know if I am right. If I am wrong, please correct me. Thank you very much. My server config options:
Also, which code path are you talking about that may be leaked? |
You could enable logging on only one of them, or put each log into a different file.
Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.
The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.
The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.
These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.
There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event. |
I fixed the bug I mentioned in version 1.9.14, so I recommend using that version for future testing. |
Thank you very much for your answer! |
Hi, did the fix help with leaking memory? |
@eegiazarov There are a lot of details in this issue that may not be relevant to your situation. I recommend opening a new issue with the details of the problem you are experiencing. |
Thanks for your reply. The only thing I can say for sure at the moment is that we're making thousands of connections in a rapid succession using different channels with its immediate destruction (and all the structures that come along) after exchange is made and the library seems to be still leaking memory. Though, there is a chance we're experiencing protocol mismatch between different platforms. |
Not sure if directly related to the issue,but we are also having some percentage of cancelled grpc calls. (connection closed by GOAWAY) Is it expected behavior? It would not be a problem,but server actually processes the grpc call while giving back call cancelled error. |
The |
We dont have max_connection_age_grace_ms option set and only use unary calls that take < 250ms. Will try max_connection_age_grace_ms to see if it gives anything. |
I recommend trying again with grpc-js version 1.10.4, in case your issue is related to #2625. |
Problem description
When I perform performance testing, the grpc client receives Call canceled。
I found that the memory of my online client service has been unable to change. This service will request a lot of grpc services. I grabbed a heap memory snapshot and found that there are many timeout objects. This seems to be caused by the link not being released?
Looking at grpc-js, I found that there is a client parameter grpc.client_idle_timeout_ms, so I set it to 30s, and then conducted a stress test. I found that the memory will slowly decrease after it grows. Maybe it is because the connection is released? But I found that during the stress test, 154,000 total requests were Call canceled 98 times. Is this because the client channel changed to the IDLE state and all clients threw Call canceled? If not, please answer, thank you
Before pressure test:
After pressure test:
Reproduction steps
Environment
Additional context
The text was updated successfully, but these errors were encountered: