compute: Add protocol metrics #16415

lluki · 2022-12-01T12:58:33Z

Feature request

We currently don't collect any statistics of the environmentd-clusterd communication. These statics would help to identify network related problems and help us measure any progress we make in optimizing message sizes. In essence, it should provide a break down of the envd-computed bandwidth used.

Part of #14663

What metrics?

Since we create a network connection to each process in a replica, it makes most sense to collect these statistics per connection.

For each command type, a size histogram.
For each response type, a size histogram.

From these metrics we can then calculate total number of messages, message rate, total number of commands sent and responses received.

Open questions

What buckets for command/response sizes?
Check if our gRPC library tonic exposes some of these stats already.

Implementation

The least overhead approach would be to instrument out gRPC library tonic to report the size of the encoded/decoded message and store it.

tonic allows to plug in custom Decoders (hyperium/tonic#999). While writing a custom prost based protofbuf encoder is not very hard, the API makes it impossible to access (or pass in) the decoder instance from the outside. Thus we'd have to rely on some global variable or other hacks to access the statistics. Even if we write a custom encoder, we can't avoid some duplicate work, because the encode function of the prost message will always call encoded_len internally and there is no clean way to access it (unless we rely on the hidden API encode_raw).

1) Two `encoded_len` calls

However I think calling an extra encoded_len is acceptable. It will traverse the Protobuf twice, but given that we create two copies anyway (for example `ComputeCommand --> ProtoComputeCommand --> protobuf encoded buffer) an extra traversal seems not too expensive (also it will have perfect locality).

If we accept that we have to call encode_len twice, the implementation becomes easier. We can extend our GrpcClient to obtain statistics about messages and responses.

2) Use http_tower::trace

This crate intercepts the HTTP protocol. Because we are using a gRPC message stream, we only initiate one single long lived request. It is possible to inspect the sent body contents using the OnBodyChunk method. It's possible to count the total bytes sent using this method.

3) Don't use generated code for API calls

Since we have only one RPC in the service definition command_response_stream the tonic generated code is actually not very complicated. The wrapper around command_response_stream is 25 lines and we could inline those. That would allows us to inject a custom Codec object.

The text was updated successfully, but these errors were encountered:

vmarcos · 2022-12-05T12:03:46Z

I think this issue dovetails with #16026, which is about exposing metrics for the compute controller. Note that the PR #16009 only introduced monitoring for computed replicas, not for the corresponding state in the compute controller, so actually we do not have AFAICT anything being exposed about the compute controller's internal state right now.

lluki · 2022-12-12T14:24:12Z

It seems to be there are two different categories: Per instance metrics () and per replica metrics. For response size/type I'd like to collect those per replica, such that we can identify slow/misbehaving replicas.

The ones outlined in #16026 I'm pretty sure a per-instance metric suffices:

Raw length of compute command history size per Instance.
Dataflow count in compute command history per Instance.
Number of collections in collections metadata map per Instance.

The command size/types distribution is with very likely the same across the replicas of an instance, even though we specialize the commands a bit per replica. So it might make sense to collect also those per replica.

lluki added C-feature Category: new feature or request A-COMPUTE Topics related to the Compute layer labels Dec 1, 2022

lluki self-assigned this Jan 4, 2023

lluki mentioned this issue Jan 5, 2023

[Tracker] Observability for Compute #14663

Open

18 tasks

lluki mentioned this issue Jan 25, 2023

Add compute/storage protocol message size metrics #17224

Merged

4 tasks

lluki closed this as completed in #17224 Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: Add protocol metrics #16415

compute: Add protocol metrics #16415

lluki commented Dec 1, 2022 •

edited

vmarcos commented Dec 5, 2022

lluki commented Dec 12, 2022

compute: Add protocol metrics #16415

compute: Add protocol metrics #16415

Comments

lluki commented Dec 1, 2022 • edited

Feature request

What metrics?

Open questions

Implementation

1) Two encoded_len calls

2) Use http_tower::trace

3) Don't use generated code for API calls

vmarcos commented Dec 5, 2022

lluki commented Dec 12, 2022

lluki commented Dec 1, 2022 •

edited

1) Two `encoded_len` calls