Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I perform performance testing, the grpc client receives Call canceled. Why is this? #2641

Open
lwmxiaobei opened this issue Jan 8, 2024 · 22 comments

Comments

@lwmxiaobei
Copy link

lwmxiaobei commented Jan 8, 2024

Problem description

When I perform performance testing, the grpc client receives Call canceled。
I found that the memory of my online client service has been unable to change. This service will request a lot of grpc services. I grabbed a heap memory snapshot and found that there are many timeout objects. This seems to be caused by the link not being released?
Looking at grpc-js, I found that there is a client parameter grpc.client_idle_timeout_ms, so I set it to 30s, and then conducted a stress test. I found that the memory will slowly decrease after it grows. Maybe it is because the connection is released? But I found that during the stress test, 154,000 total requests were Call canceled 98 times. Is this because the client channel changed to the IDLE state and all clients threw Call canceled? If not, please answer, thank you

Before pressure test:
image

After pressure test:
image

Reproduction steps

Environment

  • OS name, version and architecture: Alpine Linux v3.16(docker)
  • Node version: v14.21.2
  • Grpc version: v1.9.13

Additional context

@lwmxiaobei
Copy link
Author

Timeout object:
image

@murgatroid99
Copy link
Member

First, the cancelled calls are almost certainly unrelated to the idle timeout setting, or to the large number of timers you are seeing. The cancelled calls may be caused by the server getting overloaded and shedding load by cancelling requests. Based in the numbers in your report, this is causing errors at a rate of less than 0.1%, so I am not concerned.

Second, there are a couple of possible causes for the accumulation of timers that you are seeing. Your "Timeout object" image shows a keepalive timer. There is only one keepalive timer per connection, so if that object is representative of all of them, that likely means that you are creating a lot of connections. If you are creating a separate client object for each request, that could cause that behavior. It would be better to create a single client and use it to make every request.

On the other hand, if that keepalive timer is not representative, and most of the timers are idle timers, then you may instead be encountering a known inefficiency in how idle timers are handled. Specifically, if your test always makes a single request, waits for it to complete, and then makes another request, then that will cause a lot of timers to be cancelled.

@lwmxiaobei
Copy link
Author

First, the cancelled calls are almost certainly unrelated to the idle timeout setting, or to the large number of timers you are seeing. The cancelled calls may be caused by the server getting overloaded and shedding load by cancelling requests. Based in the numbers in your report, this is causing errors at a rate of less than 0.1%, so I am not concerned.

Second, there are a couple of possible causes for the accumulation of timers that you are seeing. Your "Timeout object" image shows a keepalive timer. There is only one keepalive timer per connection, so if that object is representative of all of them, that likely means that you are creating a lot of connections. If you are creating a separate client object for each request, that could cause that behavior. It would be better to create a single client and use it to make every request.

On the other hand, if that keepalive timer is not representative, and most of the timers are idle timers, then you may instead be encountering a known inefficiency in how idle timers are handled. Specifically, if your test always makes a single request, waits for it to complete, and then makes another request, then that will cause a lot of timers to be cancelled.

I also found a lot of Objects, which are all getInfo closures. As shown in the picture, these objects occupy nearly 50% of my heap memory. I don’t know why these objects still exist in the heap memory after the request. Can you help me?Thanks
image

@murgatroid99
Copy link
Member

This could also be caused by creating many clients. Again, if you are doing that, you should instead create a single client and use it to make all requests.

@lwmxiaobei
Copy link
Author

This could also be caused by creating many clients. Again, if you are doing that, you should instead create a single client and use it to make all requests.

Thank you, I would like to confirm whether creating a service through new essentially means establishing a connection.

var client = new routeguide.RouteGuide('localhost:50051', grpc.credentials.createInsecure());

For instance, in my scenario, the client needs to connect to a dozen or even several tens of services. Could this cause any issues? Are there best practice solutions for creating clients for multiple services, or perhaps code examples? These occurrences indeed suggest that a large number of clients might be created, so your response is greatly appreciated.

@murgatroid99
Copy link
Member

There is nothing wrong with creating a separate client object for each service you want to connect to. The problem is creating a separate client object for each request to the same service. I was only guessing that you were doing that, based on the information you provided.

If you share your test code, we will have a better chance of finding the causes of the things you are observing.

@lwmxiaobei
Copy link
Author

lwmxiaobei commented Jan 10, 2024

There is nothing wrong with creating a separate client object for each service you want to connect to. The problem is creating a separate client object for each request to the same service. I was only guessing that you were doing that, based on the information you provided.

If you share your test code, we will have a better chance of finding the causes of the things you are observing.

Thanks! Very willing to share my test code
This is the code to build the client, Obtain the configuration information, then load the proto file, parse it, and finally build the client:

const protoLoader = require('@grpc/proto-loader');
const grpcCore = require('@grpc/grpc-js');
const path = require('path');
const fs = require('fs');
module.exports = function loadGrpcClient(app) {
    app.addSingleton('grpcClient', async (config, app) => {
        config.default = app.config.grpcClient;
        return await createClient(config, app);
    });
}

async function createClient(config, app) {
    const service = {};
    await getAllServices(path.join(app.baseDir, config.proto), config, service);
    return service;
}

async function getAllServices(photoPath, config, service) {
    if (!fs.existsSync(photoPath)) {
        throw new Error('no proto file');
    }
    const photoFileList = fs.readdirSync(photoPath);
    for (const photoName of photoFileList) {
        const photoFilePath = path.join(photoPath, photoName);
        const stats = fs.statSync(photoFilePath);
        if (stats.isFile() && path.extname(photoName) === '.proto') {
            const protoObj = await protoLoader.load(photoFilePath, {
                keepCase: true,
                longs: String,
                enums: String,
                defaults: true,
                oneofs: true
            });
            const packageDefinition = grpcCore.loadPackageDefinition(protoObj);
            loadClients(packageDefinition, service, config);
        }
    }
}

function loadClients(protoDescriptor, service, config) {
    if (!protoDescriptor)
        return;

    if (protoDescriptor.fileDescriptorProtos) {
        return;
    }

    for (const key in protoDescriptor) {
        if (protoDescriptor[key].name === 'ServiceClientImpl') {
            buildClient(protoDescriptor[key].service, protoDescriptor[key], service, config);
        } else {
            loadClients(protoDescriptor[key], service, config);
        }
    }
}

function buildClient(protoDescriptor, serviceName, service, config) {
    const client = new serviceName(config.endpoint, grpcCore.credentials.createInsecure());
    for (const methoc in protoDescriptor) {
        service[methoc] = async (params) => {
            return await new Promise((resolve, reject) => {
                client.waitForReady(Date.now() + config.deadline, (error) => {
                    if (error) {
                        reject(error);
                    } else {
                        client[methoc](params, (e, res) => {
                            if (e) {
                                reject(e);
                            } else {
                                resolve(res);
                            }
                        })
                    }
                })

            })
        }
    }
}

proto file:

syntax = "proto3";
 
option java_multiple_files = true;
option java_package = "io.grpc.service.order";
option objc_class_prefix ="RTG";

package gd.pts.order;

import "google/protobuf/timestamp.proto";

service Order {
  rpc getDetailByCode(GetDetailByCodeRequest) returns (OrderDetail) {};
}
message GetDetailByCodeRequest {
    string code = 1;
}

message OrderDetail {
    string order_no = 2;
    uint64 user_id = 3;
    uint32 business_id = 6;
}

and then call the method:

const res = await this.app.grpcClient.getDetailByCode({code: 'xxxxx'});

50 concurrent performance tests, a total of 11.7w requests in 5 minutes

After a period of time after the stress test, I performed a heap memory analysis and found that these objects occupied most of the heap memory. At this time, my service had no requests, but these objects were still not recycled.

image image

@lwmxiaobei
Copy link
Author

In addition, there are no grpc options set here.

@murgatroid99
Copy link
Member

Those objects you see are socket channelz reference objects. There is one for each active connection. And with the way you construct your clients, they should be able to share connections, so there should only be one object in there. But the sizes seem to indicate that there are 2000 of them, and I don't see why that would happen from the code you shared.

Can you share trace logs from running your test? You can get them by running the client with the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG.

@lwmxiaobei
Copy link
Author

This is the log when the service starts:
logs.txt

This is the log from the stress test request:
logs (2).txt

I hope this is useful to you

@murgatroid99
Copy link
Member

First, those logs contain interspersed logs from two different processes, specifically with PIDs 62 and 69, which makes it a little harder to understand.

The two logs are snapshots separated by an hour, so I can only guess what happened in between, but the object ID numbers in the logs seem to indicate that both processes reconnected to the server several hundred times in that hour. This is very unusual behavior, so I would recommend double checking your server configuration to see why it might be causing that.

I found a code path that could occasionally leak those handles, but I don't think it accounts for 2000 of them.

@lwmxiaobei
Copy link
Author

lwmxiaobei commented Jan 15, 2024

First, those logs contain interspersed logs from two different processes, specifically with PIDs 62 and 69, which makes it a little harder to understand.

The two logs are snapshots separated by an hour, so I can only guess what happened in between, but the object ID numbers in the logs seem to indicate that both processes reconnected to the server several hundred times in that hour. This is very unusual behavior, so I would recommend double checking your server configuration to see why it might be causing that.

I found a code path that could occasionally leak those handles, but I don't think it accounts for 2000 of them.

Because my service has 2 processes started, you will see the logs of the 2 processes; this log is from a period of time during the stress test, not all logs. How can you tell that have reconnected hundreds of times? Does it mean that the server connection is disconnected, causing the client to reconnect? If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks. I don’t know if I am right. If I am wrong, please correct me. Thank you very much.

My server config options:

 'grpc.keepalive_time_ms': 2000,
 'grpc.keepalive_timeout_ms': 1000,
 'grpc.keepalive_permit_without_calls': 1,
 'grpc.use_local_subchannel_pool': 1,

Also, which code path are you talking about that may be leaked?

@murgatroid99
Copy link
Member

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

@murgatroid99
Copy link
Member

I fixed the bug I mentioned in version 1.9.14, so I recommend using that version for future testing.

@lwmxiaobei
Copy link
Author

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

Thank you very much for your answer!

@eegiazarov
Copy link

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

Thank you very much for your answer!

Hi, did the fix help with leaking memory?
I'm experiencing similar problem.

@murgatroid99
Copy link
Member

@eegiazarov There are a lot of details in this issue that may not be relevant to your situation. I recommend opening a new issue with the details of the problem you are experiencing.

@eegiazarov
Copy link

@eegiazarov There are a lot of details in this issue that may not be relevant to your situation. I recommend opening a new issue with the details of the problem you are experiencing.

Thanks for your reply. The only thing I can say for sure at the moment is that we're making thousands of connections in a rapid succession using different channels with its immediate destruction (and all the structures that come along) after exchange is made and the library seems to be still leaking memory. Though, there is a chance we're experiencing protocol mismatch between different platforms.
And on top of it, we have several unlinked cases of weird behavior...
So I'm not quite ready to file an issue. There are too many unknowns yet.

@dmitriadn
Copy link

Not sure if directly related to the issue,but we are also having some percentage of cancelled grpc calls. (connection closed by GOAWAY)
We are using grpc.max_connection_age_ms set to 60000 to trigger dns resolver refresh on clients.
Lesser the max_connection_age_ms value, higher is percentage of cancelled errors.
Tried with 1.10.1 version, the issue is still there.

Is it expected behavior? It would not be a problem,but server actually processes the grpc call while giving back call cancelled error.

@murgatroid99
Copy link
Member

The grpc.max_connection_age_ms option is generally paired with the grpc.max_connection_age_grace_ms option. If you set that second option, it limits the maximum amount of time that a preexisting stream can continue after the max connection age timer triggers, and after that streams will be cancelled. If you commonly have streams that run longer than the grpc.max_connection_age_grace_ms setting, then you should expect some cancellations. If you do not commonly have streams run that long, or you do not set that option at all, there may be a bug in the grpc.max_connection_age_ms code.

@dmitriadn
Copy link

We dont have max_connection_age_grace_ms option set and only use unary calls that take < 250ms.

Will try max_connection_age_grace_ms to see if it gives anything.

@murgatroid99
Copy link
Member

I recommend trying again with grpc-js version 1.10.4, in case your issue is related to #2625.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants