fix: track aborted status, do not stringify for tracing #2249

AVVS · 2022-10-17T21:45:57Z

I've been working on improving perf of grpc-js and a few things stood out to me:

Each Http2ServerCallStream has an empty timer created, which results in overhead, I assume this was done in an effort to have the same backing map, but performance is gained when not cleaning / creating timeouts needlessly
Track aborted event and in case stream is aborted emit cancelled, otherwise graceful stream end shouldn't cause what ultimately is exception. This allows for improved performance around cancellation logic
Wrap call to tracing as JSON.stringify + toJSON on metadata is an expensive operation and is performed regardless of tracing actually be on

Locally I've had all the tests pass on node 18, fingers crossed they pass everywhere else. I'm still mostly concerned about (2) as I don't yet fully grasp the logic around this call (checked blame and it was like that for 3 years)

Anyhow, hopefully these changes are useful or could be a starting base for further improvements

linux-foundation-easycla · 2022-10-17T21:46:00Z

The committers listed above are authorized under a signed CLA.

✅ login: AVVS / name: Vitaly Aminev (94e419c, 65dc240)

packages/grpc-js/src/server-call.ts

AVVS · 2022-10-17T22:39:16Z

to give more perspective - grpc-js unary typically benches around 23-25k rps for unary calls (I know it's not the best use of it and streams are much more efficient) with 1.7ms avg latency and wildly ranging min/max latency between 0.4ms to 25ms(! loop being blocked by GC and other native side code), but when compared to fastify being able to handle around 100k rps on my machine with avg latency of 0 and max latency of 0.1ms there is a lot of room for improvements

AVVS · 2022-10-18T20:33:03Z

In the latest commit there are several pieces that has been refactors:

channelz - select associated session stream handler on startup, this removes a lot of checks we either know to be true or false ahead of time. some code is duplicated, but not too much
commented out validate checks on get / remove - respective keys would not be there in the first place, no reason to validate as it's also in a hot path with plenty of .remove calls
receiveUnaryMessage refactored to have fewer microtasks queued and in case compression isnt used completely avoid it. this is a little messy as we introduce a few try/catch blocks and work on a promise directly in case of decompression, but errors are handled correctly and overall its much more performant

with everything in here I managed to get this to about 43k RPS with unary requests reliably

ghz --insecure --proto ./src/hello.proto --call helloworld.Greeter.SayHello -d '{"who":"nodeconf"}' -n 3000000 --connections 2 -c 400 --count-errors --timeout 0 0.0.0.0:8080 --cpus=8 --skipFirst=10000

Summary:
  Count:	2990000
  Total:	68.92 s
  Slowest:	37.58 ms
  Fastest:	0.09 ms
  Average:	7.02 ms
  Requests/sec:	43383.58

Response time histogram:
  0.094  [1]      |
  3.842  [135888] |∎∎∎∎∎∎∎∎∎∎∎∎∎
  7.590  [427175] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  11.338 [375947] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  15.087 [49140]  |∎∎∎∎∎
  18.835 [10126]  |∎
  22.583 [789]    |
  26.331 [526]    |
  30.079 [11]     |
  33.827 [183]    |
  37.575 [214]    |

Latency distribution:
  10 % in 3.38 ms
  25 % in 5.13 ms
  50 % in 7.17 ms
  75 % in 8.89 ms
  90 % in 10.45 ms
  95 % in 11.77 ms
  99 % in 15.43 ms

Status code distribution:
  [OK]   2990000 responses

Associated flame graph:

There are 2 areas of concern:

Node.js stream.respond is relatively slow
Lots of transforms between http2 headers <-> grpc.Metadata

Not sure what exactly could be done here except for trying out micro optimizations

AVVS · 2022-10-18T20:58:52Z

One of the tests failed due to node: prefix, reverted this

AVVS · 2022-10-18T21:02:12Z

in terms of switch vs if I believe if is faster up to 10 clauses based on what I've heard, but that shouldn't matter much here, still used that

packages/grpc-js/src/metadata.ts

packages/grpc-js/src/server.ts

packages/grpc-js/src/server-call.ts

murgatroid99 · 2022-10-18T21:56:24Z

packages/grpc-js/src/server-call.ts

+    stream.on('error', onEnd);
+
+    function onData(chunk: Buffer) {
+      receivedLength += chunk.length;


This changes byteLength to length. I'm not sure exactly what the difference is, but I think this may be causing the out of bounds access error in one of the tests.

AVVS · 2022-10-18T22:11:00Z

@murgatroid99 updated based on the suggestions, lets see how the tests do!

AVVS · 2022-10-18T22:41:53Z

seems like its crashing at const compressed = requestBytes.readUInt8(0) === 1; meaning message is empty, will see what changed and why this isnt handled anymore

AVVS · 2022-10-18T23:43:12Z

Seems like tests are generally ok, in terms of test targets - kind of odd that its running node 10/12 as they are both EOL. Any plans to update that?

murgatroid99 · 2022-10-18T23:47:24Z

Those test results look pretty good. The Windows failure is old and unrelated to this change. #2246 is an open PR to test on more recent versions. Also, it's only Windows that currently doesn't test on any version newer than 12.

If you would like this to be published as a patch release on 1.7.x, you'll have to rebase it onto the 1.7.x branch and then you or I can change the branch target. Otherwise it looks good to go.

AVVS · 2022-10-18T23:50:25Z

I have one more experiment with https://www.npmjs.com/package/reusify - its used in underlaying modules for fastify and should yield significant improvements if I can wrap Metadata in a way that doesn't break things. It's kind of tricky to get this done as I don't always know when it's safe to release an object. Maybe would be a good idea to have non-wrapped implementation for client calls and wrapper implementation for server side as we can generally release once the request has ended

I'll work on that tomorrow and will rebase to 1.7.x in either case

AVVS · 2022-10-18T23:51:26Z

if you look at the bottom right - this is with kernel tracing on - you can see how much time GC is taking

murgatroid99 · 2022-10-19T00:09:12Z

It is never safe to release objects that have been passed to user application code. We never know if they have stored it for later reference, and there is no API for them to release it.

AVVS · 2022-10-19T00:25:48Z

I agree, but ultimately the only "quick" way to speed up further is to significantly reduce GC and/or speed up conversion functions. Specifically Metadata class is the one creating the most pressure and pooling it would eliminate a lot of GC, each unary call I can see at least 3 objects allocated

This could be an opt-in on the part of the user via some sort of flag, idea here is that Metadata lifecycle is tied to request RPC lifecycle - whenever http streams are fully closed Metadata objects will get released.

murgatroid99 · 2022-10-19T00:41:24Z

There is quite a bit of potential complexity here (How should we implement this? Should it only cover Metadata? Should we also consider the client? etc.), so we should definitely address that in a different PR. It would also be a good idea to have a gRFC proposal to flesh out the details, because it seems like this will require a new API somewhere.

AVVS · 2022-10-19T20:57:01Z

eh gotta rebase now ❌

murgatroid99 · 2022-10-19T21:00:11Z

I don't think you want to keep that merge commit. It's probably not going to make the rebase easier.

1. no unused timers, wrap tracing calls to avoid stringifying 2. track graceful end of the call and avoid emitting 'cancelled' in such cases 3. remove validate calls in metadata on operations where it's not needed 4. refactor server session stream handlers into separate channelz enabled/disabled handlers 5. refactor message request logic - reduce amount of microtasks generated 6. improve sendStatus a little when there is no metadata involved

AVVS · 2022-10-19T21:51:14Z

eh, had to squash it all up. there is a new part around .emit('close') + .sendStatus accepting null as Metadata. This is to avoid creating these objects as there were quite a few places where they aren't needed.

AVVS · 2022-10-19T21:52:19Z

packages/grpc-js/src/server-call.ts

-    request: RequestType;
-    end: (metadata?: Metadata) => void;
-  };
+export type ServerReadableStream<RequestType, ResponseType> =


this got reformatted, I think its prettier from the project that forced the styling

murgatroid99 · 2022-10-19T22:05:30Z

packages/grpc-js/src/server-call.ts

+      }
+
+      // to compensate for a fact that cancelled is not always called
+      this.emit('close');


I don't understand this change. I am not aware of any code that is listening for this close event. In addition, in what situation would an application care about this event if it has already sent a status.

This is sort of similar to 'cancelled', but it allows you to catch all cases and do unconditional cleanup via stream.finished, for instance

before that I had to listen to 'cancelled' event and manually call call.destroy() - without that 'close' wouldn't trigger

might have gone too far with the change though? happy to revert that as I'm not 100% sure thats how its supposed to be

Reverted this and if needed we can do a follow up on this - so it's just a separate unrelated issue

If you're handling a stream of messages from the remote end, you should be listening for the end event to know when you have received all messages. And conversely, if you are sending a stream of messages, you should call the end event to indicate that you have finished writing. You shouldn't generally need to listen for the cancelled event unless you are doing a long asynchronous operation. I don't know why you would need to delete the call at all.

AVVS · 2022-10-19T22:35:58Z

Its a long async operation. Clients can disconnect at any moment due to network change, app being closed, etc. And once client disconnects stream needs to be cleaned up. There is never an ‘end’ event from either side, always abrupt disconnect that needs to be cleaned up

…

On Oct 19, 2022, 15:33 -0700, Michael Lumish ***@***.***>, wrote: @murgatroid99 commented on this pull request. In packages/grpc-js/src/server-call.ts: > - this.emit('streamEnd', false); - this.sendStatus({code: Status.CANCELLED, details: 'Cancelled by client', metadata: new Metadata()}); + + if (!this.statusSent) { + this.cancelled = true; + this.emit('cancelled', 'cancelled'); + this.emit('streamEnd', false); + this.sendStatus({ + code: Status.CANCELLED, + details: 'Cancelled by client', + metadata: null, + }); + } + + // to compensate for a fact that cancelled is not always called + this.emit('close'); If you're handling a stream of messages from the remote end, you should be listening for the end event to know when you have received all messages. And conversely, if you are sending a stream of messages, you should call the end event to indicate that you have finished writing. You shouldn't generally need to listen for the cancelled event unless you are doing a long asynchronous operation. I don't know why you would need to delete the call at all. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

murgatroid99 · 2022-10-19T23:24:30Z

In that case you should still always see the cancelled event after your change.

Anyway, it looks like the PR is in a good state now, and the tests are passing. Is there anything else you would like to do before we merge it?

AVVS · 2022-10-20T02:31:42Z

LGTM in this current state ;) if you could add hacktoberfest-accepted label after that - would be grand, too. t-shirts <3

murgatroid99

Thank you for the contribution. I am happy to see some performance improvements in this library.

Regarding Hacktoberfest, repositories opt in to that event this year, and this repository was not opted in until just now, so you almost didn't get credit for this PR. According to the website, you don't actually need that label. It is also sufficient for the PR to be approved or merged.

murgatroid99 · 2022-10-26T21:52:54Z

Can you share the profiling setup you used?

AVVS · 2022-10-27T16:41:54Z

@murgatroid99 pretty much everything is in there, you are likely going to be the most interested in this:

https://github.com/AVVS/talks/blob/main/nodeconfeu2022/scaling-nodejs/src/grpc-raw.ts

for serialization I used ts-proto + buf to generate (commited to repo), for flamegraphs I'm using clinic.js - there are a bunch of scripts you can meddle with (ie enable / disable kernel tracing)

for load producer I've used ghz, but I think you can also use something like nghttp2, but its a little bit more involved

This reverts commit b364135, reversing changes made to c84b4f9.

murgatroid99 requested changes Oct 17, 2022

View reviewed changes

packages/grpc-js/src/server-call.ts Outdated Show resolved Hide resolved

packages/grpc-js/src/server-call.ts Outdated Show resolved Hide resolved

packages/grpc-js/src/server-call.ts Outdated Show resolved Hide resolved

murgatroid99 added the kokoro:run label Oct 18, 2022

kokoro-team removed the kokoro:run label Oct 18, 2022

murgatroid99 added the kokoro:run label Oct 18, 2022

kokoro-team removed the kokoro:run label Oct 18, 2022

murgatroid99 requested changes Oct 18, 2022

View reviewed changes

packages/grpc-js/src/metadata.ts Outdated Show resolved Hide resolved

packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved

packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved

murgatroid99 reviewed Oct 18, 2022

View reviewed changes

packages/grpc-js/src/server-call.ts Outdated Show resolved Hide resolved

murgatroid99 reviewed Oct 18, 2022

View reviewed changes

murgatroid99 added the kokoro:run label Oct 18, 2022

kokoro-team removed the kokoro:run label Oct 18, 2022

murgatroid99 added the kokoro:run label Oct 18, 2022

kokoro-team removed the kokoro:run label Oct 18, 2022

AVVS force-pushed the fix/perf-issues branch from 6147d6f to c24c213 Compare October 19, 2022 20:54

AVVS changed the base branch from master to @grpc/grpc-js@1.7.x October 19, 2022 20:54

AVVS force-pushed the fix/perf-issues branch from c24c213 to 2f124ad Compare October 19, 2022 21:49

AVVS commented Oct 19, 2022

View reviewed changes

murgatroid99 reviewed Oct 19, 2022

View reviewed changes

revert: extra close event on stream

93de96f

murgatroid99 added the kokoro:run label Oct 19, 2022

kokoro-team removed the kokoro:run label Oct 19, 2022

murgatroid99 approved these changes Oct 20, 2022

View reviewed changes

murgatroid99 merged commit b364135 into grpc:@grpc/grpc-js@1.7.x Oct 20, 2022

murgatroid99 mentioned this pull request Oct 20, 2022

grpc-js: Bump to 1.7.3 #2251

Merged

murgatroid99 mentioned this pull request Nov 3, 2022

Merge 1.7.x into master #2260

Closed

CedricKassen pushed a commit to CedricKassen/grpc-node that referenced this pull request Jul 11, 2023

Revert "Merge pull request grpc#2249 from AVVS/fix/perf-issues"

ef19f1c

This reverts commit b364135, reversing changes made to c84b4f9.

CedricKassen mentioned this pull request Jul 12, 2023

Fix premature leaving of context due to improper Http2ServerCallStream handling #2501

Merged

murgatroid99 mentioned this pull request Feb 29, 2024

feat: channelz improvements, idle timeout implementation #2677

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: track aborted status, do not stringify for tracing #2249

fix: track aborted status, do not stringify for tracing #2249

AVVS commented Oct 17, 2022

linux-foundation-easycla bot commented Oct 17, 2022 •

edited

AVVS commented Oct 17, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

AVVS Oct 19, 2022

murgatroid99 Oct 19, 2022

AVVS Oct 19, 2022

AVVS Oct 19, 2022

AVVS Oct 19, 2022

murgatroid99 Oct 19, 2022

AVVS commented Oct 19, 2022 via email

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 20, 2022

murgatroid99 left a comment

murgatroid99 commented Oct 26, 2022

AVVS commented Oct 27, 2022

fix: track aborted status, do not stringify for tracing #2249

fix: track aborted status, do not stringify for tracing #2249

Conversation

AVVS commented Oct 17, 2022

linux-foundation-easycla bot commented Oct 17, 2022 • edited

AVVS commented Oct 17, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 Oct 18, 2022

Choose a reason for hiding this comment

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 commented Oct 18, 2022

AVVS commented Oct 18, 2022

AVVS commented Oct 18, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 19, 2022

AVVS Oct 19, 2022

Choose a reason for hiding this comment

murgatroid99 Oct 19, 2022

Choose a reason for hiding this comment

AVVS Oct 19, 2022

Choose a reason for hiding this comment

AVVS Oct 19, 2022

Choose a reason for hiding this comment

AVVS Oct 19, 2022

Choose a reason for hiding this comment

murgatroid99 Oct 19, 2022

Choose a reason for hiding this comment

AVVS commented Oct 19, 2022 via email

murgatroid99 commented Oct 19, 2022

AVVS commented Oct 20, 2022

murgatroid99 left a comment

Choose a reason for hiding this comment

murgatroid99 commented Oct 26, 2022

AVVS commented Oct 27, 2022

linux-foundation-easycla bot commented Oct 17, 2022 •

edited