Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some means to figure out where Zeebe's time is spent #9282

Open
pihme opened this issue May 4, 2022 · 10 comments
Open

Implement some means to figure out where Zeebe's time is spent #9282

pihme opened this issue May 4, 2022 · 10 comments
Labels
area/observability Marks an issue as observability related component/engine component/raft component/stream-platform component/zeebe Related to the Zeebe component/team kind/feature Categorizes an issue or PR as a feature, i.e. new behavior version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Milestone

Comments

@pihme
Copy link
Contributor

pihme commented May 4, 2022

Is your feature request related to a problem? Please describe.
While investigating #8991 I wanted to figure out where Zeebe spends its time. I didn't really find an efficient way to do it. I added some system outs to get at least some insights. From those I could see that Zeebe was busy for some time, then not processing anything for about 20 seconds. I didn't figure out what it did.

I tried using VisualVM and it's sampler. But the information I got from it was not helpful:
image

Describe the solution you'd like

  • Either some sort of monitoring built into the ActorScheduler that would log tasks that take longer than half a second to run,
  • Or a histogram of which tasks are run how often and take how much time,
  • Or a documented way to profile the application and how to read the profiling result
@pihme pihme added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior team/distributed labels May 4, 2022
@Zelldon
Copy link
Member

Zelldon commented May 4, 2022

During an investigation I added once some metrics for job rate and execution time, for example see here #8551 (comment)

Would this help you? I have a branch which contains still the code https://github.com/camunda/zeebe/commits/zell-execution-metrics

Furthermore regarding execution time, I think flamegraphs would be useful to you. There is a short description here https://github.com/camunda/zeebe/blob/main/benchmarks/docs/debug/README.md#profiling It might be a bit out dated (not sure) but the idea is to use async-profiler to profile the Broker and then you can see in a nice way where CPU time is spent (possible to divide into separate threads) Is this something you would like to look at?

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

The metrics not so much. I am not looking for details on jobs, I am more looking for details on tasks submitted to the actor scheduler.

Yes, the flamegraphs would be helpful. I wish we had something more interactive though. I worked with JProfiler in the past, which also had it's flaws but was a little more user-friendly.

@oleschoenburg
Copy link
Member

Maybe the IntelliJ profiling tools would be a good option for you? It's at least a little bit more interactive than running the async profiler manually.

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

I was asked to elaborate a little bit more on the solution I would like. This is also to be seen as input for the requirements of the actor scheduler of the future (#9142).

I don't want to make it too prescriptive, but mainly want to flesh out my ideas:

  • I am assuming that we will have entities like actors and tasks, that both have names and maybe a place to add annotations
  • Then I envision an annotation like MaxDuration on task level. Whenever a task is scheduled with such an annotation, the execution is monitored. If the task takes more time than MaxDuration then a warning is logged. This way we could assign time budgets to tasks and tighten screws over time
  • I also envision metrics for tasks executions. Whenever a task gets executed, we measure the time it takes and publish it to Grafana. Then we could hopefully display aggregations like the following:
Actor Task Invocations Min Avg Max 95% Total
AdminApiRequestHandler handleRequest 3 300 ms 400 ms 500 ms 430 ms 2s
StreamProcessor replayNextEvent 500 5ms 10 ms 80 ms 25 ms 200s
StreamProcessor readNextRecord 1500 12 ms 23 ms 145 ms 50 ms 400 s
...

This way we would see which actors and tasks consume how much time.

@npepinpe npepinpe added the area/observability Marks an issue as observability related label May 4, 2022
@npepinpe npepinpe added this to the 8.1 milestone May 4, 2022
@Zelldon Zelldon mentioned this issue May 5, 2022
10 tasks
zeebe-bors-camunda bot added a commit that referenced this issue May 5, 2022
9294: Add actor metrics r=Zelldon a=Zelldon

## Description

As discussed here https://camunda.slack.com/archives/C037RS2JHB8/p1651668160788749 add new actor metrics but no new panels for now. 

Details:

- Add counter for actorTask execution
- Add histogram to observe actorTask execution

Currently starting a benchmark to verify whether metrics are exported as expected.

I will create a separate PR for the atomix executors.

`@npepinpe` I'm not sure whether it fulfills all requirements for #9282 I will remove my assignment then. 



<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

related #9282 



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@Zelldon
Copy link
Member

Zelldon commented May 6, 2022

Actor metrics have been added via #9294 as discussed here https://camunda.slack.com/archives/C037RS2JHB8/p1651668608010019?thread_ts=1651668160.788749&cid=C037RS2JHB8 who ever wants to use them will add the dashboard panels, regarding to their needs.

I will remove my assignment for now.

@Zelldon Zelldon removed their assignment May 6, 2022
@Zelldon
Copy link
Member

Zelldon commented May 24, 2022

We could invest a bit in open tracing and hide that behind a feature flag. We could then also enable it in our benchmarks, to learn a bit more about the system.

Related slack thread https://camunda.slack.com/archives/C032560A9GE/p1653025715815249

Jon mentioned that there is some good support from GKE, might be worth to check https://cloud.google.com/trace/docs/setup/java-ot

@aivinog1
Copy link
Contributor

Hey @Zelldon! We are thrilled to see Open Telemetry support (since open tracing is deprecated) because our business processes strongly depend on the latency of Zeebe.

But, since we are running on the bare metal environment, we want to see Open Telemetry support in the terms of cloud (and environment) agnostic. If this is okay with you, I can start experimenting with this :) But I think that it is worth creating a separate issue for it.

@aivinog1
Copy link
Contributor

aivinog1 commented Jul 7, 2022

Hey @Zelldon! This is a kind reminder about the previous message ⬆️ :) Thanks :)

@Zelldon
Copy link
Member

Zelldon commented Jul 7, 2022

Hey @aivinog1 sorry but was not sure whether this was a question ? 😅 Sure go ahead and experiment 🤷 I know that we also want to experiment with it. I also started a bit with google cloud and open telemetry, but was not that fruitful.

@korthout
Copy link
Member

@rodrigo-lourenco-lopes The Actor metrics dashboard panel added in #12548 doesn't seem to show any data for our own cluster (Zeebe Team Engineering Automation).

Is there anything we still need to do to have this available for SaaS?

Screenshot 2023-08-11 at 14 12 35

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Marks an issue as observability related component/engine component/raft component/stream-platform component/zeebe Related to the Zeebe component/team kind/feature Categorizes an issue or PR as a feature, i.e. new behavior version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

No branches or pull requests

9 participants