Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending large complex message limits throughput #1879

Closed
snowzach opened this issue Feb 21, 2018 · 20 comments
Closed

Sending large complex message limits throughput #1879

snowzach opened this issue Feb 21, 2018 · 20 comments
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@snowzach
Copy link

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.10

What version of Go are you using (go version)?

1.9.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I have a bidirectional stream opened sending large complicated messages.

What did you expect to see?

Super Fast

What did you see instead?

Only kinda fast

So I've been looking at the code and for streams you can only call Send() one at a time. The message I am sending is very large and complex (includes things like Structs with many levels). As part of the Send operations it's encoding the message which will take some time. Meanwhile, I cannot call Send with any other threads so my operations look like Send, Encode, Send, Encode..

I think this may be severely limiting how much data I can send with the grpc stream. It would be nice if I could somehow bypass the encoding stage (so I can do it in parallel) and then when I call send it's literally only sending the message and not tying up the sender with encoding.

Perhaps if there was some mechanism to serialize the send operation such that Send was multi-threaded until it could not be so encoding and compression could be handled in parallel?

@menghanl
Copy link
Contributor

My understanding is that you want to do message encoding on your own, and send the encoded bytes directly.

One trivial way to do this to define proto messages that contain bytes only. There will still be some encoding going on in protobuf, but the overhead should be low.

Another way is to create and register your byte slice codec (how to use codec), the codec you will define should be simiar to this.
To send and receive message, you will need to convert the stream to grpc.ClientStream, and call SendMsg and RecvMsg directly.

For example:

bytes := encode(msg)
stream.(grpc.ClientStream).SendMsg(&bytes)

@snowzach
Copy link
Author

Thanks @menghanl is there any other people that have done this? Maybe this isn't really the problem. When I close the Stream from the other side, my benchmark code speeds up dramatically.

Is there no notion of buffering on SendMsg or is it tying up the machine. I will try marshalling to bytes and see if this makes it faster in testing.

@snowzach
Copy link
Author

Okay, I implemented a custom codec that overrides the proto codec. Basically if it detects a []byte it will pass it through untouched, if it's anything else it sends it to the real proto codec. This did have some effect but not really as much as I was hoping. It just seems like there's a limit to the throughput of the stream. Is this normal? Should I only expect so much and open multiple streams?

@menghanl menghanl added P2 Type: Performance Performance improvements (CPU, network, memory, etc) labels Feb 22, 2018
@snowzach
Copy link
Author

snowzach commented Feb 22, 2018

I'm streaming lots of 4k messages. Do I need to pack more into the same message in order to make it faster? I was hoping using a stream would negate the need for larger messages. The encoding/compression also seems to cause performance issues as if I enable compression I get approximately a 20% decrease in message throughput.

I looked at the code for a little bit.. I'm certainly not an expert and I'm sure it's not this simple (it never is) but would it be possible to make Send thread safe, handle message marshaling/compressing asynchronously and have maybe a small channel/buffer such that messages can get queued up as to send them more efficiently on the wire?

@snowzach
Copy link
Author

To illustrate my point, I did some more testing (I can provide code examples if you need) but basically I created a simple bidirectional stream and the message had one field which was a single google.protobuf.Value. I opened the stream and the server sent that single message as fast as it could.

I used the jsonpb library to fill that value with a single string (from JSON) that when marshalled was about 4k. I was able to send this message at 36k times per second.

I then used the jsonpb library to fill that value with a complex struct generated from https://www.json-generator.com/. When converted to protobuf(bytes) it was also about 4k. I was only able to send this message about 3.5k times per second. Same message size on the wire, 10 times slower.

This is having dramatic effect on my throughput. It even seems like if I open more than one connection it really doesn't get substantially faster. I have also opened multiple connections to the same server to try to increase the speed however beyond about 2 connections it doesn't make any difference in throughput.

With a complex protobuf message generating about a 4.3k message (after marshaling to []bytes) , I can only hit about 6k mps or 25 MB/s from a client/server on the same host over multiple streams.

@MakMukhi
Copy link
Contributor

Hey @snowzach can you benchmark how long does it take to marshal/unmarshal both kinds of messages that you have? From gRPC's point of view, performance should same for a 4k byte message. If the bottleneck boils down to the cost of serialization and deserialization then it might be a more suitable question for the protobuf team.

As a workaround, you may try the following if it helps:

  • Have multiple goroutines serialize these messages into bytes.
  • If the order doesn't matter put them in a pool
  • If the order does matter have them wait for their turn.
  • Have another goroutine which only does stream.(grpc.ClientStream).SendMsg(&bytes)
  • This sending goroutine may either read serialized messages from the common pool(if order doesn't matter) or signal goroutines for their turns. (This synchronization will be costly!)

@snowzach
Copy link
Author

snowzach commented Feb 23, 2018

@MakMukhi I have been playing around with exactly what you suggest and it makes a huge difference in the performance. I get that the protobuf I am using is complicated. I guess the point I am trying to make is that the Send function of GRPC shouldn't be single threaded (for a stream especially) if it's going to include serialization (or compression)

I also have the same problem with compression. Since it takes a small amount of time to compress the message, my message per second throughput drops dramatically and I suspect it's because the send queue is getting starved for data.

If Send was multi threaded and handled everything it could (encoding/compression) simultaneously and then send the data, it could be much faster.

@steve-gray
Copy link

@snowzach - I've implemented a change because I've observed the same behaviour - hoping it gets picked up. What I found was implementing locking around Send() bottlenecked because of the compression and serialisation in send - I found this particularly vexing for high volume bi-directional streaming. The change is at #2356 (Issue #2355 ) if you're keen to take a look and see if there's any commonalities in our respective situations.

@snowzach
Copy link
Author

snowzach commented Oct 9, 2018

@steve-gray I ended up just implementing my own stream wrapper along with my own intelligent protobuf codec. Essentially, my protobuf codec, when handed []byte will pass it through making the assumption it has already been marshaled from a protobuf struct to []byte. Then, I implemented a buffered wrapper that wraps the send function and will handle marshaling and unmarshaling protobuf structs to/from []byte in parallel and then using SendMsg to move them on. Using this, I've massively increased the throughput.

@dfawley
Copy link
Member

dfawley commented May 10, 2019

@snowzach, @steve-gray, @suyashkumar, or anyone stumbling on this issue:

We have an experimental feature implemented in #2560 (from the proposal in #2432) that may alleviate the related concerns here, and we'd be happy if you could test it out. Specifically, we're interested in:

  1. Does this API work for your use cases?
  2. Does it allow your applications to stream at the expected rate?
  3. General thoughts on the API?

@stale
Copy link

stale bot commented Sep 6, 2019

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@stale stale bot added the stale label Sep 6, 2019
@dfawley dfawley removed the stale label Sep 6, 2019
@knuesel
Copy link

knuesel commented Mar 2, 2020

@dfawley here's a different use case for which I can provide feedback: we have an embedded device that publishes sensor data as a server-streaming RPC. A microcontroller reads the sensors, serializes the values into a protobuf message and writes the result on a serial interface. The gRPC server (running on a single-core CPU) reads the messages from the serial port and passes them to all connected clients. Unfortunately the server wastes a lot of resources doing unnecessary work:

  1. After reading from the serial port, the server already has the messages in serialized protobuf format. Yet it must deserialize them just so that Send can reserialize them.
  2. In case of multiple clients, the same message must be serialized several times.

This limits the frequency at which the server can publish sensor data.

The PreparedMsg API goes in the right direction for us but fails on the following points:

  • The API doesn't let us skip deserialization/reserialization: the only way to initialize a PreparedMsg is through Encode.
  • It could still help with problem 2, but Encode takes a stream and the server has a different stream for each client.
  • I think PreparedMsg support is only implemented for messages sent from the client? In my case Encode returns the error "unable to get rpcInfo".

It would be great if there was a way to wrap existing serialized data in a PreparedMsg (and support for sending PreparedMsg from the server).

@dfawley
Copy link
Member

dfawley commented Mar 2, 2020

@knuesel you should look into a custom codec instead of PreparedMsg for this use case.

https://github.com/grpc/grpc-go/blob/master/Documentation/encoding.md

If you need the same server to be able to handle both pre-encoded and un-encoded data, the codec could first do a type assertion. If the message passed in is already a []byte, return it directly; otherwise, encode it with proto.Marshal.

@knuesel
Copy link

knuesel commented Mar 2, 2020

@dfawley I did make a quick attempt at using a custom codec on the server, but found out that I would need to implement the codec in the clients too. That would be significant work in our case as we have many different clients, currently in C++, Go, Dart, Python, Java and C#. Certainly doable but a lot of code to write/maintain for an optimization that only concerns the server. Hence my interest in the PreparedMsg solution :-).

@dfawley
Copy link
Member

dfawley commented Mar 2, 2020

You shouldn't need to implement a custom codec on the client, as long as the message is of the right type & encoding when it is sent from the server.

If you're worried about the name of the codec affecting the client's behavior, you should override the "proto" codec on the server.

@snowzach
Copy link
Author

snowzach commented Mar 2, 2020

@knuesel check out https://github.com/snowzach/protosmart
I just pushed that code. I haven't actually used it in a really long time but it does what you want. Essentially if you receive using a byte buffer or send using a byte buffer it assumes you do not want to convert to proto message or you have already converted to bytes from a proto message struct and bypassses the conversion. I used this to convert thousands of very large messages in parallel to byte before sending thus increasing my throughput. You could use this to greatly decrease CPU.

If you have trouble with it, let me know. I'll try to help. Like I said, it's been a while since I used it.

@knuesel
Copy link

knuesel commented Mar 2, 2020

@dfawley many thanks! It works indeed when I override the proto codec using the CustomCodec ServerOption. This applies to the whole server though while I only have pre-serialized messages for one service (and only in the outbound direction). Is there a way to use the custom codec for only one service? If not a PreparedMsg solution would still be nicer, but this is already quite good.

@knuesel
Copy link

knuesel commented Mar 2, 2020

@snowzach this looks very nice, thanks! I'll try it as soon as possible.

@knuesel
Copy link

knuesel commented Mar 2, 2020

@snowzach this is very elegant and flexible and it works like a charm! (@dfawley this completely addresses my use case so you can disregard it). Thanks again to both of you.

@dfawley
Copy link
Member

dfawley commented May 3, 2021

I think this can be resolved. If there is more to do here, please let us know.

@dfawley dfawley closed this as completed May 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

No branches or pull requests

6 participants