feat: channelz improvements, idle timeout implementation #2677

AVVS · 2024-02-27T20:43:10Z

Hopefully you'd find the following code useful, namely it works on 2 areas:

channelz perf improvements
session idle timeout implementation

channelz perf improvements

In the original implementation arrays are used to track active servers, channels, subchannels and sockets. while it generally performs ok with the first 3, using deletes on sockets arrays creates holey arrays, and iterating over them is extremely slow when, lets say, you perform a few thousand queries across many sessions and you have left over sessions with ids that have intervals between them - you have to go over each "undefined" item in the array and check it. time complexity grows with the array length. to deal with that I've used ordered map based on rb trees, which allows us to iterate over many sockets with no problems, as well as perform deletes fairly fast

furthermore I've added "stubbed" tracker implementations, which doesnt currently cover all the calls, but covers mostly everything - there could be more improvements done to non-channelz paths. with the stubbed calls, which do nothing we can have less branching in shared areas between channelz/non-channelz code.

added channelz/non-channelz session handler

idle session termination

More efficient than sending keepalive pings, creates a timer when session has no active streams and destroys it when a stream is created
Sessions are now tracked via their references
Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)

TODO:

tests covering correct work of idle session termination

dependency updates

Updated to require node >= 12 for queueMicrotask support, nextTick is fairly expensive due to c++ MakeCallback overhead and ideally is reserved for js<->cpp interop
tsconfig targets very old environments - haven't touched that, but is there any chance it can get an update? I understand there is a browser component to it, but maybe it makes sense to have separate builds for browser / server? node 16 is EOL and 18 is in maintenance, surely something can be done about it?
there are more outdated deps, havent touched it

prettier changes

to get the tests going had to "fix" linting issues from pretties

Let me know your thoughts on proposed changes

murgatroid99 · 2024-02-27T21:09:32Z

I just submitted a change on the 1.10.x branch to run the code formatter and fix the one lint error. Can you rebase onto that so that we can focus on the substantive changes in this PR?

AVVS · 2024-02-27T21:30:03Z

done

murgatroid99 · 2024-02-27T21:42:39Z

Also, can you rearrange the modified functions in server.ts to match the original order, to make the diff clearer? Right now, it's kind of hard to read.

packages/grpc-js/src/channelz.ts

AVVS · 2024-02-27T22:40:34Z

cleaned up server.ts more - reverted closure (originally needed server ref, but was able to make it work without it), fewer loc to go through now

murgatroid99

I think there is a way of implementing the session idle timeout that is both simpler and more efficient than what you have here: for each session, track only the number of active streams, and call session.setTimeout with a handler that closes the session if the number of active streams is 0.

Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)

I don't believe that was a bug. The relevant change I see was in tryShutdown, which is supposed to be shutting down all sessions anyway. Also, there could always be multiple servers, but the internal representation was different in earlier versions.

packages/grpc-js/src/channelz.ts

murgatroid99 · 2024-02-27T22:35:37Z

packages/grpc-js/src/channelz.ts

@@ -598,27 +585,34 @@ function GetTopChannels(
  call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>,
  callback: sendUnaryData<GetTopChannelsResponse>
 ): void {
-  const maxResults = Number.parseInt(call.request.max_results);
+  const maxResults = parseInt(call.request.max_results, 10) || 100;


Where did this 100 come from?

protofiles say there is a sane default that will be selected by the server if max_results is not provided by the request - 100 is my choice for the sane default. problem with 0 is that there will be no results returned with the current implementation. it could be a much higher number, but I think a 100 for an iteration is fine

Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.

murgatroid99 · 2024-02-27T22:58:14Z

packages/grpc-js/src/server.ts

+          sessionClosedByServer = true;
+
+          this.trace(
+            `Connection dropped by max connection age: ${session.socket?.remoteAddress}`


Existing trace logs use string concatenation instead of string interpolation because of a previous report that it is faster. Please follow that convention here and in other instances.

packages/grpc-js/src/server.ts

murgatroid99 · 2024-02-27T23:06:29Z

packages/grpc-js/src/server.ts

+                sessionClosedByServer = true;
+                this.channelzTrace.addTrace(
+                  'CT_INFO',
+                  `Connection dropped due to error of a ping frame ${err.message} return in ${duration}`


Similar to the regular trace logs, I think we should avoid string interpolation in channelz trace logs, especially now that the string will always be created even if channelz is disabled.

packages/grpc-js/src/server.ts

AVVS · 2024-02-28T00:23:33Z

I think there is a way of implementing the session idle timeout that is both simpler and more efficient than what you have here: for each session, track only the number of active streams, and call session.setTimeout with a handler that closes the session if the number of active streams is 0.

I believe session.setTimeout would not be very useful as it treats activity differently from having no active in-flight requests. keepalive pings would keep it opened forever, whereas timeout on no active requests would force those sessions to be closed. There should be very little performance drop in cases where active session has many ongoing requests and rarely reaches 0 tracked streams simply because setTimeout would almost never be called.

Where this comes handy is when there is no edge proxy that actively forces all streams into few sessions. recently I've started using https://cloud.google.com/load-balancing/docs/https with gke-l7-global-external-managed and for some reason the http2 configuration doesn't really like to reuse existing connections - so I easily end up with 6000-8000 "stale" sessions, which are all perfectly fine (ie keepalives are being returned for hours at a time - so connections are active), except they are idle and no requests are active, often for a long time. To conserve memory I'd like these sessions to close, which is where I want the idle timeout handling it

packages/grpc-js/src/server.ts

murgatroid99 · 2024-02-28T18:00:20Z

I believe session.setTimeout would not be very useful as it treats activity differently from having no active in-flight requests. keepalive pings would keep it opened forever, whereas timeout on no active requests would force those sessions to be closed. There should be very little performance drop in cases where active session has many ongoing requests and rarely reaches 0 tracked streams simply because setTimeout would almost never be called.

The performance concern is the case where the number of active streams quickly switches between 0 and 1. I understand why session.setTimeout won't work; instead I think you should implement the same optimization as in #2643: track the number of active streams and the timestamp of the last time the number became 0. Set the timer for the full idle timeout, then when it fires, if there are 0 active streams, and it has been long enough since it became 0, close the connection. If there are 0 active streams but it became 0 too recently, set the timer to fire at the idle timeout after the last recorded timestamp. If there are more than 0 active streams, set the timer for the full time.

Where this comes handy is when there is no edge proxy that actively forces all streams into few sessions. recently I've started using https://cloud.google.com/load-balancing/docs/https with gke-l7-global-external-managed and for some reason the http2 configuration doesn't really like to reuse existing connections - so I easily end up with 6000-8000 "stale" sessions, which are all perfectly fine (ie keepalives are being returned for hours at a time - so connections are active), except they are idle and no requests are active, often for a long time. To conserve memory I'd like these sessions to close, which is where I want the idle timeout handling it

There is another problem here, independent of gRPC. A client/proxy should not be creating many connections to the same backend, and keeping them around without using them. There may be a configuration error or bug in your LB.

AVVS · 2024-02-28T21:37:19Z

https://cloud.google.com/load-balancing/docs/https#http2_support

HTTP/2 between the load balancer and the instance can require significantly more TCP connections to the instance than HTTP(S). Connection pooling, an optimization that reduces the number of these connections with HTTP(S), is not currently available with HTTP/2. As a result, you might see high backend latencies because backend connections are made more frequently.

Just a drawback of current implementation for http/2 that google has ^ Could be solved with another edge proxy in-between, but trying to keep it without a managed proxy in-between

The performance concern is the case where the number of active streams quickly switches between 0 and 1. I understand why session.setTimeout won't work; instead I think you should implement the same optimization as in #2643: track the number of active streams and the timestamp of the last time the number became 0. Set the timer for the full idle timeout, then when it fires, if there are 0 active streams, and it has been long enough since it became 0, close the connection. If there are 0 active streams but it became 0 too recently, set the timer to fire at the idle timeout after the last recorded timestamp. If there are more than 0 active streams, set the timer for the full time.

Take a look at the latest version of the implementation. I think it might work even better with .refresh() on the timer, lmk

murgatroid99 · 2024-02-28T22:11:15Z

I think the new timer strategy can be further simplified: set the timeout to the full max connection idle time, and then the handler logic can just be "if there are 0 active streams, close the connection". Refreshing whenever the number of active connections becomes 0 guarantees that the timer only runs with 0 active connections if the max connection idle time has passed.

AVVS · 2024-02-28T22:27:31Z

you are right, fixed a bug as well - forgot to actually update idle date, as well as wrong this ref

murgatroid99

With the channelz class stub changes, I wonder if the split between channelz and non-channelz versions of the stream and session handlers is still worth it. That split is an optimization, to avoid doing any channelz work when it's disabled, but it results in duplicated code, and even more duplicated code in this change.

murgatroid99 · 2024-02-28T22:12:20Z

packages/grpc-js/src/server.ts

@@ -972,7 +1006,7 @@ export class Server {
      for (const session of allSessions) {
        session.destroy(http2.constants.NGHTTP2_CANCEL as any);
      }
-    }, graceTimeMs).unref?.();
+    }, graceTimeMs).unref();


Please revert this, here and anywhere else you made this change. We need a null check on unref because that method doesn't exist on Electron.

murgatroid99 · 2024-02-28T22:12:53Z

packages/grpc-js/src/server.ts

+        this.sessionHalfIdleTimeout,
+        this,
+        session
+      ).unref(),


Similarly, change this to unref?.().

now that makes sense, though on electron it would result in a timer ref being undefined, so need to unref afterwards :/

That's a good point. That behavior might be incorrect elsewhere.

pretty much every timer wouldnt work correctly as of now. is the server being actually used on electron? its no problem for me to fix the related bugs everywhere, but if we can avoid extra logic - would be happy :)

murgatroid99 · 2024-02-28T22:22:54Z

packages/grpc-js/src/server.ts

+
+  private onIdleTimeout(ctx: Server, session: http2.ServerHttp2Session) {
+    const { socket } = session;
+    ctx.trace(


As currently written, this trace log seems potentially noisy, without a lot of information relevant to debugging a problem with the idle timeout behavior. I would remove this log, or I would want to at least include the number of active streams, and either the lastIdle timestamp or the computed idleFor time.

packages/grpc-js/src/server.ts

murgatroid99 · 2024-02-28T22:29:41Z

packages/grpc-js/src/server.ts

-        streamTracker: new ChannelzCallTracker(),
+        streamTracker: this.channelzEnabled
+          ? new ChannelzCallTracker()
+          : new ChannelzCallTrackerStub(),


If we are in the _channelzSessionHandler, that implies that this.channelzEnabled === true. Branching on that again here is redundant.

AVVS · 2024-02-29T17:05:51Z

Tried to address all existing concerns with code duplication, as well as kept perf optimizations in mind:

introduced reusify (https://npmjs.com/package/reusify) for context objects that are tied to Http2Session lifecycle, allows us to avoid GC on those object, but also have optimized event handlers
Moved out setting max age, keepalive and idle timeouts to their own functions, as well as prepared handlers for them so that creating closures is reduced to a minimum
noticed that there is a potentially leaking keepalive timeout handler (during try/catch block when session gets destroyed, made to sure to clean it up)
cleaned up all unref?.() calls
sessionz / non-sessionz handlers are now shorter with much less code duplication

LMK if the changes make sense and if you spot more bugs or logical issues

murgatroid99 · 2024-02-29T17:36:59Z

I don't like this SessionCtx change. It seems like it's addressing a separate issue from what this PR was covering before, and I don't like having any instances of // @ts-expect-error in the code base (including the one you added earlier in the channelz code).

AVVS · 2024-02-29T18:10:46Z

I don't like this SessionCtx change. It seems like it's addressing a separate issue from what this PR was covering before, and I don't like having any instances of // @ts-expect-error in the code base (including the one you added earlier in the channelz code).

It's true, but how do we "dry" the codebase and keep it performant? this change sort of addresses both issues, but I understand the expect error is not ideal. its fairly easy to "fix" it by changing visibility level on the property.

Larger issue is whether we are ok with duplicated code & perf hit for channelz (I'd prefer not to have any perf hit due to that) or not. We could do binds, or create closures each time, like we currently do, but it adds up to GC pressure and code paths can't be optimized properly, ideally we want all functions to be monomoprhic and inlined and no GC where it's not needed

If we don't want to do anything about code duplication in this PR - I can revert the latest change, and keep unref?.() fixes and push out introduction of session ctx updates to another PR?

murgatroid99 · 2024-02-29T19:00:04Z

To be honest, I had forgotten that you were the one who originally split up _streamHandler and _channelzHandler in #2249, until I just checked. This PR continues that pattern with the split between _sessionHandler and _channelzSessionHandler. That is the code duplication I was talking about: each of those pairs has two variations of the same functionality. This SessionCtx change reduces the value of that split by making the two functions converge on the same code path anyway, but it still leaves some duplicated code. If we are willing to converge the code paths anyway, why not just have only one version of the function?

Please do remove the SessionCtx change from this PR. You can send another PR where we can focus discussion on that change.

AVVS · 2024-02-29T19:23:22Z

To be honest, I had forgotten that you were the one who originally split up _streamHandler and _channelzHandler in #2249, until I just checked. This PR continues that pattern with the split between _sessionHandler and _channelzSessionHandler. That is the code duplication I was talking about: each of those pairs has two variations of the same functionality. This SessionCtx change reduces the value of that split by making the two functions converge on the same code path anyway, but it still leaves some duplicated code. If we are willing to converge the code paths anyway, why not just have only one version of the function?

I don't particularly like *Ref objects that are used for tracking channelz stuff - to fully converge these paths need to be also covered and refs shouldn't be created when channelz is not enabled. I'll think more on how to converge both listeners into a single one with little perf hit

Will only keep relevant changes to timers in this PR and make a new PR re: code convergence

murgatroid99 · 2024-02-29T20:36:26Z

At this point, channelz refs are used as general purpose process-unique identifiers for various kinds of objects. When channelz is disabled, they're very lightweight, so I don't see that it's important to remove them.

AVVS · 2024-03-02T16:08:58Z

Reverted sessionCtx change, kept cleaned up `unref?.(), added tests

One thing I noticed is that there is no way for client to stop reconnecting except for .close() to be called on it. Is this intended? Shouldn't it lazy connect on first outgoing call after the disconnect instead of connecting channel right away?

murgatroid99 · 2024-03-04T17:29:09Z

Actually, you're right, that is what it's supposed to do. It looks like that behavior got screwed up a little while ago.

murgatroid99 · 2024-03-04T21:59:02Z

packages/grpc-js/src/channelz.ts

+
+    if (trackedChild === undefined) {
+      tracker.setElement(child.id, {
+        // @ts-expect-error union issues


I don't like having this. What exactly is the error here? Are there any changes that could be made to the function or the data type to avoid that error?

just typescript being silly and not generalizing the type good enough, I'll see if I can work around that

adjusted this a little, but its give or take the same, just not an expect error, so it will be more noticeable if there is something other than the typing bug

murgatroid99 · 2024-03-04T22:02:00Z

packages/grpc-js/src/channelz.ts

@@ -598,27 +585,34 @@ function GetTopChannels(
  call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>,
  callback: sendUnaryData<GetTopChannelsResponse>
 ): void {
-  const maxResults = Number.parseInt(call.request.max_results);
+  const maxResults = parseInt(call.request.max_results, 10) || 100;


Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.

murgatroid99 · 2024-03-04T22:11:11Z

packages/grpc-js/src/server.ts

@@ -334,65 +370,61 @@ export class Server {

  private getChannelzSessionInfoGetter(


With this modified signature, this method is more accurately named getChannelzSessionInfo.

murgatroid99 · 2024-03-05T17:55:13Z

packages/grpc-js/src/channelz.ts

+    const id = getNextId();
+    const ref = { id, name, kind } as RefByType<R>;
+    if (channelzEnabled) {
+      // @ts-expect-error typing issues


This is another instance that I didn't notice before. Please change this too.

I’ll fix it up similarly.

AVVS · 2024-03-05T20:17:15Z

Noticed tests failed, should I increase the deadline for the state change? Or maybe we can look at traces for this failure?

murgatroid99 · 2024-03-05T21:36:06Z

First, I'm just rerunning them, to check if that failure is consistent. You could try just increasing the timeout, to see if that works. Otherwise, you could add a .only to the test group you added, and set the tracing environment variables in the gulpfile so that we can get the trace logs.

murgatroid99

I think this is in a reasonable state now. I would like to have less flakiness in the new tests, but I don't think that needs to be a blocker.

murgatroid99 · 2024-03-07T18:57:46Z

This is out in 1.10.2, along with a fix for the "client reconnects even when there are no active requests" behavior.

feat: channelz improvements, idle timeout implementation

e0b900d

AVVS force-pushed the feat/channelz-perf+stream-idle-timeout branch from b2da92a to e0b900d Compare February 27, 2024 21:28

chore: move new functions towards the end of the class

a4a676d

murgatroid99 reviewed Feb 27, 2024

View reviewed changes

packages/grpc-js/src/channelz.ts Outdated Show resolved Hide resolved

AVVS added 3 commits February 27, 2024 14:30

chore: revert interface -> type change in channelz

b8f157e

chore: cleanup traces

0b79b74

chore: extraneous closure, dont need server ref

74102fc

murgatroid99 requested changes Feb 27, 2024

View reviewed changes

chore: updated docs, cached onStreamClose per session

11a98b5

AVVS commented Feb 28, 2024

View reviewed changes

packages/grpc-js/src/server.ts Show resolved Hide resolved

refactor: no clearTimeout/null timers, use .refresh() + count refs

bedb505

chore: simplify idle timeout further, fix wrong ref

b873dce

murgatroid99 requested changes Feb 28, 2024

View reviewed changes

chore: tests & cleanup of unref?.()

62e8ea9

AVVS force-pushed the feat/channelz-perf+stream-idle-timeout branch from 8f49ea8 to 62e8ea9 Compare March 2, 2024 15:59

murgatroid99 mentioned this pull request Mar 4, 2024

grpc-js: pick_first: Don't automatically reconnect after connection drop #2680

Merged

murgatroid99 requested changes Mar 4, 2024

View reviewed changes

chore: use iterators for tracking map, const for default values

cf321a8

murgatroid99 reviewed Mar 5, 2024

View reviewed changes

murgatroid99 added the kokoro:run label Mar 5, 2024

kokoro-team removed kokoro:run labels Mar 5, 2024

murgatroid99 added the kokoro:rebuild label Mar 5, 2024

kokoro-team removed the kokoro:rebuild label Mar 5, 2024

chore: address ts errors

74ddb3b

murgatroid99 added the kokoro:run label Mar 6, 2024

kokoro-team removed the kokoro:run label Mar 6, 2024

murgatroid99 added the kokoro:rebuild label Mar 6, 2024

kokoro-team removed the kokoro:rebuild label Mar 6, 2024

murgatroid99 approved these changes Mar 6, 2024

View reviewed changes

murgatroid99 merged commit 400147c into grpc:@grpc/grpc-js@1.10.x Mar 6, 2024
4 of 5 checks passed

murgatroid99 mentioned this pull request Mar 7, 2024

grpc-js: Bump to 1.10.2 #2684

Merged

jeffijoe mentioned this pull request Mar 15, 2024

Requests seem not to respond anymore after upgrading to 1.10.2 #2690

Closed

murgatroid99 mentioned this pull request Apr 9, 2024

grpc-js always reconnects when a connection drops, even if there are no pending calls #2718

Open

		@@ -334,65 +370,61 @@ export class Server {

		private getChannelzSessionInfoGetter(

feat: channelz improvements, idle timeout implementation #2677

feat: channelz improvements, idle timeout implementation #2677

Conversation

AVVS commented Feb 27, 2024

channelz perf improvements

idle session termination

dependency updates

prettier changes

murgatroid99 commented Feb 27, 2024

AVVS commented Feb 27, 2024

murgatroid99 commented Feb 27, 2024

AVVS commented Feb 27, 2024

murgatroid99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AVVS commented Feb 28, 2024

murgatroid99 commented Feb 28, 2024

AVVS commented Feb 28, 2024

murgatroid99 commented Feb 28, 2024

AVVS commented Feb 28, 2024

murgatroid99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AVVS commented Feb 29, 2024

murgatroid99 commented Feb 29, 2024

AVVS commented Feb 29, 2024

murgatroid99 commented Feb 29, 2024

AVVS commented Feb 29, 2024

murgatroid99 commented Feb 29, 2024

AVVS commented Mar 2, 2024

murgatroid99 commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AVVS commented Mar 5, 2024

murgatroid99 commented Mar 5, 2024

murgatroid99 left a comment

Choose a reason for hiding this comment

murgatroid99 commented Mar 7, 2024