Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: channelz improvements, idle timeout implementation #2677

Conversation

AVVS
Copy link
Contributor

@AVVS AVVS commented Feb 27, 2024

Hopefully you'd find the following code useful, namely it works on 2 areas:

  1. channelz perf improvements
  2. session idle timeout implementation

channelz perf improvements

In the original implementation arrays are used to track active servers, channels, subchannels and sockets. while it generally performs ok with the first 3, using deletes on sockets arrays creates holey arrays, and iterating over them is extremely slow when, lets say, you perform a few thousand queries across many sessions and you have left over sessions with ids that have intervals between them - you have to go over each "undefined" item in the array and check it. time complexity grows with the array length. to deal with that I've used ordered map based on rb trees, which allows us to iterate over many sockets with no problems, as well as perform deletes fairly fast

furthermore I've added "stubbed" tracker implementations, which doesnt currently cover all the calls, but covers mostly everything - there could be more improvements done to non-channelz paths. with the stubbed calls, which do nothing we can have less branching in shared areas between channelz/non-channelz code.

added channelz/non-channelz session handler

idle session termination

More efficient than sending keepalive pings, creates a timer when session has no active streams and destroys it when a stream is created
Sessions are now tracked via their references
Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)

TODO:

  • tests covering correct work of idle session termination

dependency updates

  • Updated to require node >= 12 for queueMicrotask support, nextTick is fairly expensive due to c++ MakeCallback overhead and ideally is reserved for js<->cpp interop
  • tsconfig targets very old environments - haven't touched that, but is there any chance it can get an update? I understand there is a browser component to it, but maybe it makes sense to have separate builds for browser / server? node 16 is EOL and 18 is in maintenance, surely something can be done about it?
  • there are more outdated deps, havent touched it

prettier changes

to get the tests going had to "fix" linting issues from pretties

Let me know your thoughts on proposed changes

@murgatroid99
Copy link
Member

I just submitted a change on the 1.10.x branch to run the code formatter and fix the one lint error. Can you rebase onto that so that we can focus on the substantive changes in this PR?

@AVVS AVVS force-pushed the feat/channelz-perf+stream-idle-timeout branch from b2da92a to e0b900d Compare February 27, 2024 21:28
@AVVS
Copy link
Contributor Author

AVVS commented Feb 27, 2024

done

@murgatroid99
Copy link
Member

Also, can you rearrange the modified functions in server.ts to match the original order, to make the diff clearer? Right now, it's kind of hard to read.

@AVVS
Copy link
Contributor Author

AVVS commented Feb 27, 2024

cleaned up server.ts more - reverted closure (originally needed server ref, but was able to make it work without it), fewer loc to go through now

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a way of implementing the session idle timeout that is both simpler and more efficient than what you have here: for each session, track only the number of active streams, and call session.setTimeout with a handler that closes the session if the number of active streams is 0.

Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)

I don't believe that was a bug. The relevant change I see was in tryShutdown, which is supposed to be shutting down all sessions anyway. Also, there could always be multiple servers, but the internal representation was different in earlier versions.

packages/grpc-js/src/channelz.ts Show resolved Hide resolved
@@ -598,27 +585,34 @@ function GetTopChannels(
call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>,
callback: sendUnaryData<GetTopChannelsResponse>
): void {
const maxResults = Number.parseInt(call.request.max_results);
const maxResults = parseInt(call.request.max_results, 10) || 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this 100 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protofiles say there is a sane default that will be selected by the server if max_results is not provided by the request - 100 is my choice for the sane default. problem with 0 is that there will be no results returned with the current implementation. it could be a much higher number, but I think a 100 for an iteration is fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.

sessionClosedByServer = true;

this.trace(
`Connection dropped by max connection age: ${session.socket?.remoteAddress}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing trace logs use string concatenation instead of string interpolation because of a previous report that it is faster. Please follow that convention here and in other instances.

packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
sessionClosedByServer = true;
this.channelzTrace.addTrace(
'CT_INFO',
`Connection dropped due to error of a ping frame ${err.message} return in ${duration}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the regular trace logs, I think we should avoid string interpolation in channelz trace logs, especially now that the string will always be created even if channelz is disabled.

packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
@AVVS
Copy link
Contributor Author

AVVS commented Feb 28, 2024

I think there is a way of implementing the session idle timeout that is both simpler and more efficient than what you have here: for each session, track only the number of active streams, and call session.setTimeout with a handler that closes the session if the number of active streams is 0.

I believe session.setTimeout would not be very useful as it treats activity differently from having no active in-flight requests. keepalive pings would keep it opened forever, whereas timeout on no active requests would force those sessions to be closed. There should be very little performance drop in cases where active session has many ongoing requests and rarely reaches 0 tracked streams simply because setTimeout would almost never be called.

Where this comes handy is when there is no edge proxy that actively forces all streams into few sessions. recently I've started using https://cloud.google.com/load-balancing/docs/https with gke-l7-global-external-managed and for some reason the http2 configuration doesn't really like to reuse existing connections - so I easily end up with 6000-8000 "stale" sessions, which are all perfectly fine (ie keepalives are being returned for hours at a time - so connections are active), except they are idle and no requests are active, often for a long time. To conserve memory I'd like these sessions to close, which is where I want the idle timeout handling it

@murgatroid99
Copy link
Member

I believe session.setTimeout would not be very useful as it treats activity differently from having no active in-flight requests. keepalive pings would keep it opened forever, whereas timeout on no active requests would force those sessions to be closed. There should be very little performance drop in cases where active session has many ongoing requests and rarely reaches 0 tracked streams simply because setTimeout would almost never be called.

The performance concern is the case where the number of active streams quickly switches between 0 and 1. I understand why session.setTimeout won't work; instead I think you should implement the same optimization as in #2643: track the number of active streams and the timestamp of the last time the number became 0. Set the timer for the full idle timeout, then when it fires, if there are 0 active streams, and it has been long enough since it became 0, close the connection. If there are 0 active streams but it became 0 too recently, set the timer to fire at the idle timeout after the last recorded timestamp. If there are more than 0 active streams, set the timer for the full time.

Where this comes handy is when there is no edge proxy that actively forces all streams into few sessions. recently I've started using https://cloud.google.com/load-balancing/docs/https with gke-l7-global-external-managed and for some reason the http2 configuration doesn't really like to reuse existing connections - so I easily end up with 6000-8000 "stale" sessions, which are all perfectly fine (ie keepalives are being returned for hours at a time - so connections are active), except they are idle and no requests are active, often for a long time. To conserve memory I'd like these sessions to close, which is where I want the idle timeout handling it

There is another problem here, independent of gRPC. A client/proxy should not be creating many connections to the same backend, and keeping them around without using them. There may be a configuration error or bug in your LB.

@AVVS
Copy link
Contributor Author

AVVS commented Feb 28, 2024

https://cloud.google.com/load-balancing/docs/https#http2_support

HTTP/2 between the load balancer and the instance can require significantly more TCP connections to the instance than HTTP(S). Connection pooling, an optimization that reduces the number of these connections with HTTP(S), is not currently available with HTTP/2. As a result, you might see high backend latencies because backend connections are made more frequently.

Just a drawback of current implementation for http/2 that google has ^ Could be solved with another edge proxy in-between, but trying to keep it without a managed proxy in-between

The performance concern is the case where the number of active streams quickly switches between 0 and 1. I understand why session.setTimeout won't work; instead I think you should implement the same optimization as in #2643: track the number of active streams and the timestamp of the last time the number became 0. Set the timer for the full idle timeout, then when it fires, if there are 0 active streams, and it has been long enough since it became 0, close the connection. If there are 0 active streams but it became 0 too recently, set the timer to fire at the idle timeout after the last recorded timestamp. If there are more than 0 active streams, set the timer for the full time.

Take a look at the latest version of the implementation. I think it might work even better with .refresh() on the timer, lmk

@murgatroid99
Copy link
Member

I think the new timer strategy can be further simplified: set the timeout to the full max connection idle time, and then the handler logic can just be "if there are 0 active streams, close the connection". Refreshing whenever the number of active connections becomes 0 guarantees that the timer only runs with 0 active connections if the max connection idle time has passed.

@AVVS
Copy link
Contributor Author

AVVS commented Feb 28, 2024

you are right, fixed a bug as well - forgot to actually update idle date, as well as wrong this ref

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the channelz class stub changes, I wonder if the split between channelz and non-channelz versions of the stream and session handlers is still worth it. That split is an optimization, to avoid doing any channelz work when it's disabled, but it results in duplicated code, and even more duplicated code in this change.

@@ -972,7 +1006,7 @@ export class Server {
for (const session of allSessions) {
session.destroy(http2.constants.NGHTTP2_CANCEL as any);
}
}, graceTimeMs).unref?.();
}, graceTimeMs).unref();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this, here and anywhere else you made this change. We need a null check on unref because that method doesn't exist on Electron.

this.sessionHalfIdleTimeout,
this,
session
).unref(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, change this to unref?.().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that makes sense, though on electron it would result in a timer ref being undefined, so need to unref afterwards :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. That behavior might be incorrect elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much every timer wouldnt work correctly as of now. is the server being actually used on electron? its no problem for me to fix the related bugs everywhere, but if we can avoid extra logic - would be happy :)


private onIdleTimeout(ctx: Server, session: http2.ServerHttp2Session) {
const { socket } = session;
ctx.trace(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written, this trace log seems potentially noisy, without a lot of information relevant to debugging a problem with the idle timeout behavior. I would remove this log, or I would want to at least include the number of active streams, and either the lastIdle timestamp or the computed idleFor time.

packages/grpc-js/src/server.ts Show resolved Hide resolved
streamTracker: new ChannelzCallTracker(),
streamTracker: this.channelzEnabled
? new ChannelzCallTracker()
: new ChannelzCallTrackerStub(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are in the _channelzSessionHandler, that implies that this.channelzEnabled === true. Branching on that again here is redundant.

@AVVS
Copy link
Contributor Author

AVVS commented Feb 29, 2024

Tried to address all existing concerns with code duplication, as well as kept perf optimizations in mind:

  1. introduced reusify (https://npmjs.com/package/reusify) for context objects that are tied to Http2Session lifecycle, allows us to avoid GC on those object, but also have optimized event handlers
  2. Moved out setting max age, keepalive and idle timeouts to their own functions, as well as prepared handlers for them so that creating closures is reduced to a minimum
  3. noticed that there is a potentially leaking keepalive timeout handler (during try/catch block when session gets destroyed, made to sure to clean it up)
  4. cleaned up all unref?.() calls
  5. sessionz / non-sessionz handlers are now shorter with much less code duplication

LMK if the changes make sense and if you spot more bugs or logical issues

@murgatroid99
Copy link
Member

I don't like this SessionCtx change. It seems like it's addressing a separate issue from what this PR was covering before, and I don't like having any instances of // @ts-expect-error in the code base (including the one you added earlier in the channelz code).

@AVVS
Copy link
Contributor Author

AVVS commented Feb 29, 2024

I don't like this SessionCtx change. It seems like it's addressing a separate issue from what this PR was covering before, and I don't like having any instances of // @ts-expect-error in the code base (including the one you added earlier in the channelz code).

It's true, but how do we "dry" the codebase and keep it performant? this change sort of addresses both issues, but I understand the expect error is not ideal. its fairly easy to "fix" it by changing visibility level on the property.

Larger issue is whether we are ok with duplicated code & perf hit for channelz (I'd prefer not to have any perf hit due to that) or not. We could do binds, or create closures each time, like we currently do, but it adds up to GC pressure and code paths can't be optimized properly, ideally we want all functions to be monomoprhic and inlined and no GC where it's not needed

If we don't want to do anything about code duplication in this PR - I can revert the latest change, and keep unref?.() fixes and push out introduction of session ctx updates to another PR?

@murgatroid99
Copy link
Member

To be honest, I had forgotten that you were the one who originally split up _streamHandler and _channelzHandler in #2249, until I just checked. This PR continues that pattern with the split between _sessionHandler and _channelzSessionHandler. That is the code duplication I was talking about: each of those pairs has two variations of the same functionality. This SessionCtx change reduces the value of that split by making the two functions converge on the same code path anyway, but it still leaves some duplicated code. If we are willing to converge the code paths anyway, why not just have only one version of the function?

Please do remove the SessionCtx change from this PR. You can send another PR where we can focus discussion on that change.

@AVVS
Copy link
Contributor Author

AVVS commented Feb 29, 2024

To be honest, I had forgotten that you were the one who originally split up _streamHandler and _channelzHandler in #2249, until I just checked. This PR continues that pattern with the split between _sessionHandler and _channelzSessionHandler. That is the code duplication I was talking about: each of those pairs has two variations of the same functionality. This SessionCtx change reduces the value of that split by making the two functions converge on the same code path anyway, but it still leaves some duplicated code. If we are willing to converge the code paths anyway, why not just have only one version of the function?

I don't particularly like *Ref objects that are used for tracking channelz stuff - to fully converge these paths need to be also covered and refs shouldn't be created when channelz is not enabled. I'll think more on how to converge both listeners into a single one with little perf hit

Will only keep relevant changes to timers in this PR and make a new PR re: code convergence

@murgatroid99
Copy link
Member

At this point, channelz refs are used as general purpose process-unique identifiers for various kinds of objects. When channelz is disabled, they're very lightweight, so I don't see that it's important to remove them.

@AVVS AVVS force-pushed the feat/channelz-perf+stream-idle-timeout branch from 8f49ea8 to 62e8ea9 Compare March 2, 2024 15:59
@AVVS
Copy link
Contributor Author

AVVS commented Mar 2, 2024

Reverted sessionCtx change, kept cleaned up `unref?.(), added tests

One thing I noticed is that there is no way for client to stop reconnecting except for .close() to be called on it. Is this intended? Shouldn't it lazy connect on first outgoing call after the disconnect instead of connecting channel right away?

@murgatroid99
Copy link
Member

Actually, you're right, that is what it's supposed to do. It looks like that behavior got screwed up a little while ago.


if (trackedChild === undefined) {
tracker.setElement(child.id, {
// @ts-expect-error union issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having this. What exactly is the error here? Are there any changes that could be made to the function or the data type to avoid that error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just typescript being silly and not generalizing the type good enough, I'll see if I can work around that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted this a little, but its give or take the same, just not an expect error, so it will be more noticeable if there is something other than the typing bug

@@ -598,27 +585,34 @@ function GetTopChannels(
call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>,
callback: sendUnaryData<GetTopChannelsResponse>
): void {
const maxResults = Number.parseInt(call.request.max_results);
const maxResults = parseInt(call.request.max_results, 10) || 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.

@@ -334,65 +370,61 @@ export class Server {

private getChannelzSessionInfoGetter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this modified signature, this method is more accurately named getChannelzSessionInfo.

const id = getNextId();
const ref = { id, name, kind } as RefByType<R>;
if (channelzEnabled) {
// @ts-expect-error typing issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another instance that I didn't notice before. Please change this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll fix it up similarly.

@AVVS
Copy link
Contributor Author

AVVS commented Mar 5, 2024

Noticed tests failed, should I increase the deadline for the state change? Or maybe we can look at traces for this failure?

@murgatroid99
Copy link
Member

First, I'm just rerunning them, to check if that failure is consistent. You could try just increasing the timeout, to see if that works. Otherwise, you could add a .only to the test group you added, and set the tracing environment variables in the gulpfile so that we can get the trace logs.

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a reasonable state now. I would like to have less flakiness in the new tests, but I don't think that needs to be a blocker.

@murgatroid99 murgatroid99 merged commit 400147c into grpc:@grpc/grpc-js@1.10.x Mar 6, 2024
4 of 5 checks passed
@murgatroid99
Copy link
Member

This is out in 1.10.2, along with a fix for the "client reconnects even when there are no active requests" behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants