Implement some means to figure out where Zeebe's time is spent #9282

pihme · 2022-05-04T07:26:32Z

Is your feature request related to a problem? Please describe.
While investigating #8991 I wanted to figure out where Zeebe spends its time. I didn't really find an efficient way to do it. I added some system outs to get at least some insights. From those I could see that Zeebe was busy for some time, then not processing anything for about 20 seconds. I didn't figure out what it did.

I tried using VisualVM and it's sampler. But the information I got from it was not helpful:

Describe the solution you'd like

Either some sort of monitoring built into the ActorScheduler that would log tasks that take longer than half a second to run,
Or a histogram of which tasks are run how often and take how much time,
Or a documented way to profile the application and how to read the profiling result

Zelldon · 2022-05-04T08:36:12Z

During an investigation I added once some metrics for job rate and execution time, for example see here #8551 (comment)

Would this help you? I have a branch which contains still the code https://github.com/camunda/zeebe/commits/zell-execution-metrics

Furthermore regarding execution time, I think flamegraphs would be useful to you. There is a short description here https://github.com/camunda/zeebe/blob/main/benchmarks/docs/debug/README.md#profiling It might be a bit out dated (not sure) but the idea is to use async-profiler to profile the Broker and then you can see in a nice way where CPU time is spent (possible to divide into separate threads) Is this something you would like to look at?

pihme · 2022-05-04T09:35:26Z

The metrics not so much. I am not looking for details on jobs, I am more looking for details on tasks submitted to the actor scheduler.

Yes, the flamegraphs would be helpful. I wish we had something more interactive though. I worked with JProfiler in the past, which also had it's flaws but was a little more user-friendly.

oleschoenburg · 2022-05-04T09:43:30Z

Maybe the IntelliJ profiling tools would be a good option for you? It's at least a little bit more interactive than running the async profiler manually.

pihme · 2022-05-04T10:40:59Z

I was asked to elaborate a little bit more on the solution I would like. This is also to be seen as input for the requirements of the actor scheduler of the future (#9142).

I don't want to make it too prescriptive, but mainly want to flesh out my ideas:

I am assuming that we will have entities like actors and tasks, that both have names and maybe a place to add annotations
Then I envision an annotation like MaxDuration on task level. Whenever a task is scheduled with such an annotation, the execution is monitored. If the task takes more time than MaxDuration then a warning is logged. This way we could assign time budgets to tasks and tighten screws over time
I also envision metrics for tasks executions. Whenever a task gets executed, we measure the time it takes and publish it to Grafana. Then we could hopefully display aggregations like the following:

Actor	Task	Invocations	Min	Avg	Max	95%	Total
AdminApiRequestHandler	handleRequest	3	300 ms	400 ms	500 ms	430 ms	2s
StreamProcessor	replayNextEvent	500	5ms	10 ms	80 ms	25 ms	200s
StreamProcessor	readNextRecord	1500	12 ms	23 ms	145 ms	50 ms	400 s
...

This way we would see which actors and tasks consume how much time.

9294: Add actor metrics r=Zelldon a=Zelldon ## Description As discussed here https://camunda.slack.com/archives/C037RS2JHB8/p1651668160788749 add new actor metrics but no new panels for now. Details: - Add counter for actorTask execution - Add histogram to observe actorTask execution Currently starting a benchmark to verify whether metrics are exported as expected. I will create a separate PR for the atomix executors. `@npepinpe` I'm not sure whether it fulfills all requirements for #9282 I will remove my assignment then.  ## Related issues  related #9282 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

Zelldon · 2022-05-06T04:52:57Z

Actor metrics have been added via #9294 as discussed here https://camunda.slack.com/archives/C037RS2JHB8/p1651668608010019?thread_ts=1651668160.788749&cid=C037RS2JHB8 who ever wants to use them will add the dashboard panels, regarding to their needs.

I will remove my assignment for now.

Zelldon · 2022-05-24T19:27:25Z

We could invest a bit in open tracing and hide that behind a feature flag. We could then also enable it in our benchmarks, to learn a bit more about the system.

Related slack thread https://camunda.slack.com/archives/C032560A9GE/p1653025715815249

Jon mentioned that there is some good support from GKE, might be worth to check https://cloud.google.com/trace/docs/setup/java-ot

aivinog1 · 2022-06-29T06:05:21Z

Hey @Zelldon! We are thrilled to see Open Telemetry support (since open tracing is deprecated) because our business processes strongly depend on the latency of Zeebe.

But, since we are running on the bare metal environment, we want to see Open Telemetry support in the terms of cloud (and environment) agnostic. If this is okay with you, I can start experimenting with this :) But I think that it is worth creating a separate issue for it.

aivinog1 · 2022-07-07T05:36:55Z

Hey @Zelldon! This is a kind reminder about the previous message ⬆️ :) Thanks :)

Zelldon · 2022-07-07T19:03:25Z

Hey @aivinog1 sorry but was not sure whether this was a question ? 😅 Sure go ahead and experiment 🤷 I know that we also want to experiment with it. I also started a bit with google cloud and open telemetry, but was not that fruitful.

korthout · 2023-08-11T12:13:23Z

@rodrigo-lourenco-lopes The Actor metrics dashboard panel added in #12548 doesn't seem to show any data for our own cluster (Zeebe Team Engineering Automation).

Is there anything we still need to do to have this available for SaaS?

pihme added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior team/distributed labels May 4, 2022

pihme mentioned this issue May 4, 2022

Partitions gets unhealthy if many timers are scheduled #8991

Open

Zelldon referenced this issue May 4, 2022

add actor metrics

82d618e

npepinpe assigned Zelldon May 4, 2022

npepinpe added the area/observability Marks an issue as observability related label May 4, 2022

npepinpe added this to the 8.1 milestone May 4, 2022

Zelldon mentioned this issue May 5, 2022

Add actor metrics #9294

Merged

10 tasks

oleschoenburg mentioned this issue May 5, 2022

Overview of ActorScheduler limitations #9183

Closed

Zelldon removed their assignment May 6, 2022

remcowesterhoud added the version:8.1.0-alpha2 label Jun 7, 2022

aivinog1 mentioned this issue Jul 8, 2022

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

Open

1 task

menski removed the team/distributed label Jul 11, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Zelldon added component/engine component/stream-platform component/raft labels Jan 5, 2023

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement some means to figure out where Zeebe's time is spent #9282

Implement some means to figure out where Zeebe's time is spent #9282

pihme commented May 4, 2022

Zelldon commented May 4, 2022

pihme commented May 4, 2022

oleschoenburg commented May 4, 2022

pihme commented May 4, 2022 •

edited

Zelldon commented May 6, 2022

Zelldon commented May 24, 2022

aivinog1 commented Jun 29, 2022

aivinog1 commented Jul 7, 2022

Zelldon commented Jul 7, 2022

korthout commented Aug 11, 2023

Implement some means to figure out where Zeebe's time is spent #9282

Implement some means to figure out where Zeebe's time is spent #9282

Comments

pihme commented May 4, 2022

Zelldon commented May 4, 2022

pihme commented May 4, 2022

oleschoenburg commented May 4, 2022

pihme commented May 4, 2022 • edited

Zelldon commented May 6, 2022

Zelldon commented May 24, 2022

aivinog1 commented Jun 29, 2022

aivinog1 commented Jul 7, 2022

Zelldon commented Jul 7, 2022

korthout commented Aug 11, 2023

pihme commented May 4, 2022 •

edited