feat: add eventloop utilization default metric #518

ivan-tymoshenko · 2022-10-04T13:02:58Z

Close #506

ivan-tymoshenko · 2022-10-04T13:55:23Z

@Eomm, I find you as a contributor of fastify-elu-scaler. If you have some time and interested, PTAL.

mcollina

lgtm

ivan-tymoshenko · 2022-10-05T10:59:11Z

To measure ELU, you need to get the first utilization value, wait for some timeout, get the second utilization value and then count the "difference". The are at least two ways to implement this.

Implementation that you can see in this PR. It's pretty straightforward. When you call the collect method, it starts the measurement, waits for a timeout, and returns the result. It is an async metric, which means you will have to wait for the timeout to get the result, and because all other default metrics are sync, it might be confusing if you set up a big timeout.
You can find this implementation here. It gets the elu metric all the time in setInterval. That means I can get the last measurement in a sync way. But it adds some side effect behavior (timeout) and makes observations without an explicit call of the collect function.

I don't know which way is more acceptable from user perspective.

zbjornson · 2022-10-05T16:43:07Z

Thank you both for the PR and review.

@trevnorris you wrote a nice blog post on this (and I think did a lot of the impl?), any recommendation whether we should use setInterval to measure continuously or setTimeout to only measure when probed? One possible difference is that on-probe will be biased by other prom-client collectors.

trevnorris · 2022-10-10T18:26:30Z

@zbjornson I'd recommend using a setInterval. The ELU is meant to be recorded as regular time series data, the same as CPU.

ivan-tymoshenko · 2022-10-11T09:41:06Z

@zbjornson Does summary sliding window work for sum and count parameters? It looks like not. How I can count the average value for some time period?

zeldrinn · 2022-12-05T19:33:33Z

@zbjornson what's the status on getting this merged? thanks!

johnytiago · 2023-10-16T11:48:16Z

Hey @zbjornson @ivan-tymoshenko is this PR blocked? Anything we can do to help get this through? 🙏🏽

kibertoad · 2024-01-02T14:40:30Z

@glensc @SimenB Is there anything we can do to help push this over the finish line?

ivan-tymoshenko · 2024-01-02T17:51:02Z

I don't remember why it was blocked. I will take a look this week.

ivan-tymoshenko · 2024-01-10T13:03:51Z

@trevnorris @glensc @SimenB @johnytiago Please take a look if this implementation makes sence and works for you.

simon-paris · 2024-01-12T05:56:23Z

Hey, looking forward to seeing this merged! 🚀

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

And also, this code I think could give incorrect data:

It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.
It could mistakenly report a small utilization value multiple times if a long blocking event happens near the end of an interval period. E.g. if it's idle for 99ms, then blocks for 101ms, you'd report [0.505, 0.505] when you should report [0.01, 1]

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

const blockedIntervalsNumber = Math.round(timeMs / intervalTimeout);
for (let i = 0; i < blockedIntervalsNumber; i++) {
	summary.observe(value);
	histogram.observe(value);
}

ivan-tymoshenko · 2024-01-12T13:12:41Z

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.
https://nodesource.com/blog/event-loop-utilization-nodejs/
https://github.com/nearform/fastify-elu-scaler/blob/42efb4bf84ed1ffe7373e5f0d1ede7c92c2a7683/plugins/elu.js#L26

It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1. Of cource rounding this value is an approximation, but I simply don't see a better way to cover this case.

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

The main question for me here is when we should start and stop measuring elu in this case.

simon-paris · 2024-01-15T05:37:32Z

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.

You're right, I tried it out at 100ms and it looks good.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1.

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

setInterval(() => {
  setTimeout(() => {
    const t1 = Date.now();
    while (Date.now() < t1 + 110) {}
  }, 50);
}, 100);

ivan-tymoshenko · 2024-01-15T11:53:21Z

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

I understand. If you have a suggestion on how to measure elu in a more accurate way, you are welcome.

kibertoad · 2024-04-02T07:34:28Z

@ivan-tymoshenko is this mostly complete and just need conflicts resolved, or something is still missing?

trevnorris · 2024-04-19T23:51:15Z

@ivan-tymoshenko Just want to make sure it's understood, ELU as measured by Node isn't approximated. I patched libuv so it tracks it down to the system call. The interval length you choose to get the ELU won't change what it returns. It's always precise.

feat: add eventloop utilization default metric

f9243e1

ivan-tymoshenko mentioned this pull request Oct 4, 2022

feat(db): add eventloop utilization to the metrics platformatic/platformatic#80

Merged

mcollina approved these changes Oct 5, 2022

View reviewed changes

ivan-tymoshenko added 2 commits October 5, 2022 20:16

docs: add eventloop utilization to CHANGELOG.md

0f099d0

test: skip eventloop utilization test for nodejs v10

1806a70

ivan-tymoshenko added 3 commits January 6, 2024 19:23

Merge branch 'master' into add-event-loop-utilization-metric

91be71c

feat: add elu metric to the default metric list

b619eff

feat: use histogram and summary for elu calculation

f4fac17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eventloop utilization default metric #518

feat: add eventloop utilization default metric #518

ivan-tymoshenko commented Oct 4, 2022

ivan-tymoshenko commented Oct 4, 2022

mcollina left a comment

ivan-tymoshenko commented Oct 5, 2022

zbjornson commented Oct 5, 2022

trevnorris commented Oct 10, 2022

ivan-tymoshenko commented Oct 11, 2022

zeldrinn commented Dec 5, 2022 •

edited

johnytiago commented Oct 16, 2023

kibertoad commented Jan 2, 2024

ivan-tymoshenko commented Jan 2, 2024

ivan-tymoshenko commented Jan 10, 2024

simon-paris commented Jan 12, 2024

ivan-tymoshenko commented Jan 12, 2024

simon-paris commented Jan 15, 2024 •

edited

ivan-tymoshenko commented Jan 15, 2024

kibertoad commented Apr 2, 2024

trevnorris commented Apr 19, 2024

feat: add eventloop utilization default metric #518

Are you sure you want to change the base?

feat: add eventloop utilization default metric #518

Conversation

ivan-tymoshenko commented Oct 4, 2022

ivan-tymoshenko commented Oct 4, 2022

mcollina left a comment

Choose a reason for hiding this comment

ivan-tymoshenko commented Oct 5, 2022

zbjornson commented Oct 5, 2022

trevnorris commented Oct 10, 2022

ivan-tymoshenko commented Oct 11, 2022

zeldrinn commented Dec 5, 2022 • edited

johnytiago commented Oct 16, 2023

kibertoad commented Jan 2, 2024

ivan-tymoshenko commented Jan 2, 2024

ivan-tymoshenko commented Jan 10, 2024

simon-paris commented Jan 12, 2024

ivan-tymoshenko commented Jan 12, 2024

simon-paris commented Jan 15, 2024 • edited

ivan-tymoshenko commented Jan 15, 2024

kibertoad commented Apr 2, 2024

trevnorris commented Apr 19, 2024

zeldrinn commented Dec 5, 2022 •

edited

simon-paris commented Jan 15, 2024 •

edited