You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently don't collect any statistics of the environmentd-clusterd communication. These statics would help to identify network related problems and help us measure any progress we make in optimizing message sizes. In essence, it should provide a break down of the envd-computed bandwidth used.
Since we create a network connection to each process in a replica, it makes most sense to collect these statistics per connection.
For each command type, a size histogram.
For each response type, a size histogram.
From these metrics we can then calculate total number of messages, message rate, total number of commands sent and responses received.
Open questions
What buckets for command/response sizes?
Check if our gRPC library tonic exposes some of these stats already.
Implementation
The least overhead approach would be to instrument out gRPC library tonic to report the size of the encoded/decoded message and store it.
tonic allows to plug in custom Decoders (hyperium/tonic#999). While writing a custom prost based protofbuf encoder is not very hard, the API makes it impossible to access (or pass in) the decoder instance from the outside. Thus we'd have to rely on some global variable or other hacks to access the statistics. Even if we write a custom encoder, we can't avoid some duplicate work, because the encode function of the prost message will always call encoded_len internally and there is no clean way to access it (unless we rely on the hidden API encode_raw).
1) Two encoded_len calls
However I think calling an extra encoded_len is acceptable. It will traverse the Protobuf twice, but given that we create two copies anyway (for example `ComputeCommand --> ProtoComputeCommand --> protobuf encoded buffer) an extra traversal seems not too expensive (also it will have perfect locality).
If we accept that we have to call encode_len twice, the implementation becomes easier. We can extend our GrpcClient to obtain statistics about messages and responses.
This crate intercepts the HTTP protocol. Because we are using a gRPC message stream, we only initiate one single long lived request. It is possible to inspect the sent body contents using the OnBodyChunk method. It's possible to count the total bytes sent using this method.
3) Don't use generated code for API calls
Since we have only one RPC in the service definition command_response_stream the tonic generated code is actually not very complicated. The wrapper around command_response_stream is 25 lines and we could inline those. That would allows us to inject a custom Codec object.
The text was updated successfully, but these errors were encountered:
I think this issue dovetails with #16026, which is about exposing metrics for the compute controller. Note that the PR #16009 only introduced monitoring for computed replicas, not for the corresponding state in the compute controller, so actually we do not have AFAICT anything being exposed about the compute controller's internal state right now.
It seems to be there are two different categories: Per instance metrics () and per replica metrics. For response size/type I'd like to collect those per replica, such that we can identify slow/misbehaving replicas.
The ones outlined in #16026 I'm pretty sure a per-instance metric suffices:
Raw length of compute command history size per Instance.
Dataflow count in compute command history per Instance.
Number of collections in collections metadata map per Instance.
The command size/types distribution is with very likely the same across the replicas of an instance, even though we specialize the commands a bit per replica. So it might make sense to collect also those per replica.
Feature request
We currently don't collect any statistics of the environmentd-clusterd communication. These statics would help to identify network related problems and help us measure any progress we make in optimizing message sizes. In essence, it should provide a break down of the envd-computed bandwidth used.
Part of #14663
What metrics?
Since we create a network connection to each process in a replica, it makes most sense to collect these statistics per connection.
From these metrics we can then calculate total number of messages, message rate, total number of commands sent and responses received.
Open questions
Implementation
The least overhead approach would be to instrument out gRPC library tonic to report the size of the encoded/decoded message and store it.
tonic allows to plug in custom Decoders (hyperium/tonic#999). While writing a custom prost based protofbuf encoder is not very hard, the API makes it impossible to access (or pass in) the decoder instance from the outside. Thus we'd have to rely on some global variable or other hacks to access the statistics. Even if we write a custom encoder, we can't avoid some duplicate work, because the
encode
function of the prost message will always callencoded_len
internally and there is no clean way to access it (unless we rely on the hidden APIencode_raw
).1) Two
encoded_len
callsHowever I think calling an extra
encoded_len
is acceptable. It will traverse the Protobuf twice, but given that we create two copies anyway (for example `ComputeCommand --> ProtoComputeCommand --> protobuf encoded buffer) an extra traversal seems not too expensive (also it will have perfect locality).If we accept that we have to call
encode_len
twice, the implementation becomes easier. We can extend ourGrpcClient
to obtain statistics about messages and responses.2) Use http_tower::trace
This crate intercepts the HTTP protocol. Because we are using a gRPC message stream, we only initiate one single long lived request. It is possible to inspect the sent body contents using the OnBodyChunk method. It's possible to count the total bytes sent using this method.
3) Don't use generated code for API calls
Since we have only one RPC in the service definition
command_response_stream
the tonic generated code is actually not very complicated. The wrapper aroundcommand_response_stream
is 25 lines and we could inline those. That would allows us to inject a custom Codec object.The text was updated successfully, but these errors were encountered: