How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

liufuyang · 2022-11-04T12:22:18Z

As you may know, this change been merged on the opentelemetry-go-contrib side recently to start reporting rpc.server.duration with meter created as

c.meter.SyncInt64().Histogram("rpc.server.duration", instrument.WithUnit(unit.Milliseconds))`

On our backend, we have a similar implementation. But when the data is exported to GoogleCloudMonitoring, we seem cannot find a good way to plot the latency graph.

The generated metric has a Kind: CUMULATIVE on it, as the picture 1 below, while comparing with a Google internal cloud run latency graph, the data has Kind: DELTA, see in picture 2.

So it is expected that the kind should be CUMULATIVE when GoogleCloudPlatform/opentelemetry-operations-go is used? And if so, how can I plot a latency graph on Google Monitoring?

Thank you :)

Extra info:

I am not sure what this aligner really means here but when I choose our metric exported from this package, there is a single delta to choose.

The text was updated successfully, but these errors were encountered:

damemi · 2022-11-04T13:04:06Z

It looks like delta aggregation for int64 values isn't permitted for custom metrics https://cloud.google.com/monitoring/api/v3/kinds-and-types#kind-type-combos. @dashpole do you have any context on that?

dashpole · 2022-11-04T13:06:23Z

@dashpole do you have any context on that?

I don't.

@liufuyang what options are available for aggregation?

liufuyang · 2022-11-04T14:22:09Z

Do you mean what options do I see in the aggregator field?

Also that I noticed the UI looks different if I try to draw a graph on the Dashboard, the same metric is selected and it is on the same "advanced" tab.

liufuyang · 2022-11-04T15:43:18Z

Hey there, sorry for the inconvenience, I think we fixed our issue by updating to the newest version einride/cloudrunner-go#340

Thank you for the help and I will just close it down for now. We can reopen this if we still find other problems related to using opentelemetry exporter with Histogram metrics.

dashpole · 2022-11-04T15:44:41Z

Ah, glad to hear that resolved things.

liufuyang · 2022-11-04T16:33:32Z

Thanks. By the way, since you are on top this now, do you know how to use MQL to draw or derive the request rate from the CUMULATIVE duration Histogram data?

I know that in PromQL something like this could do:

rate(workload_googleapis_com:rpc_server_duration_count{monitored_resource="generic_task" }[1m])

But on the MQL side, I am not sure how to do it. Thank you :)

dashpole · 2022-11-04T16:41:43Z

Try the count aggregator?

liufuyang · 2022-11-04T17:05:00Z

Aha, nice thank you very much :D

liufuyang · 2022-11-04T17:14:48Z

Hmmm... it seems not working with the UI tool to set aggregator ad count? 🤔

dashpole · 2022-11-04T17:31:33Z

Based on https://cloud.google.com/monitoring/charts/charting-distribution-metrics, it seems like maybe sum is what you want (but I would've expected sum to be the total time taken by requests). I may be mistaken

Alternatively, you can actually use promql to query these metrics if you want: https://cloud.google.com/stackdriver/docs/managed-prometheus/promql

dashpole · 2022-11-04T17:36:36Z

(but sum also doesn't seem to do what I want either)

dashpole · 2022-11-04T17:43:43Z

Actually, I think I found it. count_from seems to give the number of events in the distribution.

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| every 1m

Runs for me

liufuyang · 2022-11-04T20:46:05Z

Aha, thank you very much I did like that and it indeed works

fetch generic_task
| metric 'workload.googleapis.com/rpc.server.duration'
| count_from
| rate
| group_by [metric.rpc_service, metric.rpc_grpc_code, resource.location],
    [value_duration_aggregate: aggregate(value_duration_count_from)]
| every 1m

It plots the same graph (right) comparing counted by our other customized metric rpc_count (left), which is request per minute

liufuyang · 2022-11-19T09:56:10Z

@dashpole Sorry to bother you again, I think I need the last bit of help here so we could use those metrics nicely in production. The question I have is how to plot the ratio between the two group's requests rate?

As we know above by using count_from and rate we could view the requests rate, now we would very much like to plot the error ratio.

I've tried it like this:

fetch generic_task
| metric 'workload.googleapis.com/rpc.server.duration'
| count_from
| rate
| filter_ratio_by [metric.rpc_service, resource.location], metric.rpc_grpc_code != 'OK'
| group_by sliding(5m), sum(val())
| condition val() > .05 '10^2.%'

But it gives a quite wrong-looking graph. What we need is basically during a time window, let's say 5 minutes, how many percentages of the requests have rpc_grpc_code not as OK.

It would be very appreciated if you could give us a hand on this. I've tried to read the doc but could not understand MQL well, also asked on Stackoverflow however not many know the answer I am afraid.

Thank you in advance.

dashpole · 2022-11-21T22:24:12Z

@liufuyang I'm quite a bit out of my MQL depth, but I think you might want to do your group_by before you do your filter_ratio. If all of your ratios are 10%, but you have 5 streams, sum(val()) will output 50%. If you switch the order, you are summing the rates and errors (e.g. 10 req/ sec + 10 req/sec and 1 err / sec + 1 err /sec = 20 req / sec and 2 err / sec) first, and then computing a ratio.

When I tried your query on the rtt metric above:

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| filter_ratio_by [resource.instance_id], metric.remote_zone != 'us-central1-a'
| group_by sliding(5m), sum(val())

It gave me a graph with values between 0 and 5.

But if I changed it to:

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| group_by sliding(5m), sum(val())
| filter_ratio_by [resource.instance_id], metric.remote_zone != 'us-central1-a'

It went to a graph with values between 0 and 1, which is what I expected to see from a ratio.

liufuyang · 2022-11-22T16:56:20Z

Aha, thank you so much @dashpole, by switching the group_by and filter_ratio_by indeed gives us correct-looking results 👍

Super appreciated your help on this 🙏

dashpole added question Further information is requested priority: p1 labels Nov 4, 2022

liufuyang closed this as completed Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

liufuyang commented Nov 4, 2022 •

edited

damemi commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022 •

edited

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

dashpole commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

liufuyang commented Nov 19, 2022

dashpole commented Nov 21, 2022 •

edited

liufuyang commented Nov 22, 2022

How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

Comments

liufuyang commented Nov 4, 2022 • edited

Extra info:

damemi commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022 • edited

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

liufuyang commented Nov 4, 2022

dashpole commented Nov 4, 2022

dashpole commented Nov 4, 2022

dashpole commented Nov 4, 2022

liufuyang commented Nov 4, 2022

liufuyang commented Nov 19, 2022

dashpole commented Nov 21, 2022 • edited

liufuyang commented Nov 22, 2022

liufuyang commented Nov 4, 2022 •

edited

liufuyang commented Nov 4, 2022 •

edited

dashpole commented Nov 21, 2022 •

edited