Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: Add protocol metrics #16415

Closed
Tracked by #14663
lluki opened this issue Dec 1, 2022 · 2 comments · Fixed by #17224
Closed
Tracked by #14663

compute: Add protocol metrics #16415

lluki opened this issue Dec 1, 2022 · 2 comments · Fixed by #17224
Assignees
Labels
A-COMPUTE Topics related to the Compute layer C-feature Category: new feature or request

Comments

@lluki
Copy link
Contributor

lluki commented Dec 1, 2022

Feature request

We currently don't collect any statistics of the environmentd-clusterd communication. These statics would help to identify network related problems and help us measure any progress we make in optimizing message sizes. In essence, it should provide a break down of the envd-computed bandwidth used.

Part of #14663

What metrics?

Since we create a network connection to each process in a replica, it makes most sense to collect these statistics per connection.

  • For each command type, a size histogram.
  • For each response type, a size histogram.

From these metrics we can then calculate total number of messages, message rate, total number of commands sent and responses received.

Open questions

  • What buckets for command/response sizes?
  • Check if our gRPC library tonic exposes some of these stats already.

Implementation

The least overhead approach would be to instrument out gRPC library tonic to report the size of the encoded/decoded message and store it.

tonic allows to plug in custom Decoders (hyperium/tonic#999). While writing a custom prost based protofbuf encoder is not very hard, the API makes it impossible to access (or pass in) the decoder instance from the outside. Thus we'd have to rely on some global variable or other hacks to access the statistics. Even if we write a custom encoder, we can't avoid some duplicate work, because the encode function of the prost message will always call encoded_len internally and there is no clean way to access it (unless we rely on the hidden API encode_raw).

1) Two encoded_len calls

However I think calling an extra encoded_len is acceptable. It will traverse the Protobuf twice, but given that we create two copies anyway (for example `ComputeCommand --> ProtoComputeCommand --> protobuf encoded buffer) an extra traversal seems not too expensive (also it will have perfect locality).

If we accept that we have to call encode_len twice, the implementation becomes easier. We can extend our GrpcClient to obtain statistics about messages and responses.

2) Use http_tower::trace

This crate intercepts the HTTP protocol. Because we are using a gRPC message stream, we only initiate one single long lived request. It is possible to inspect the sent body contents using the OnBodyChunk method. It's possible to count the total bytes sent using this method.

3) Don't use generated code for API calls

Since we have only one RPC in the service definition command_response_stream the tonic generated code is actually not very complicated. The wrapper around command_response_stream is 25 lines and we could inline those. That would allows us to inject a custom Codec object.

@lluki lluki added C-feature Category: new feature or request A-COMPUTE Topics related to the Compute layer labels Dec 1, 2022
@vmarcos
Copy link
Contributor

vmarcos commented Dec 5, 2022

I think this issue dovetails with #16026, which is about exposing metrics for the compute controller. Note that the PR #16009 only introduced monitoring for computed replicas, not for the corresponding state in the compute controller, so actually we do not have AFAICT anything being exposed about the compute controller's internal state right now.

@lluki
Copy link
Contributor Author

lluki commented Dec 12, 2022

It seems to be there are two different categories: Per instance metrics () and per replica metrics. For response size/type I'd like to collect those per replica, such that we can identify slow/misbehaving replicas.

The ones outlined in #16026 I'm pretty sure a per-instance metric suffices:

  • Raw length of compute command history size per Instance.
  • Dataflow count in compute command history per Instance.
  • Number of collections in collections metadata map per Instance.

The command size/types distribution is with very likely the same across the replicas of an instance, even though we specialize the commands a bit per replica. So it might make sense to collect also those per replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-COMPUTE Topics related to the Compute layer C-feature Category: new feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants