As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

aivinog1 · 2022-07-08T08:02:06Z

Is your feature request related to a problem? Please describe.
This issue originates from my comment. So, the main idea is to add Open Telemetry support for Zeebe, to figure out where time was spent. This task is about investigating and providing MVP and not about a comprehended solution.

Describe the solution you'd like
I think that the best is to stick with the OpenTelemetry SDK Autoconfigure module and disable it by default via environment variables or properties.

Tasks:

[OpenTelemetry] Add the OpenTelemetry automatic instrumentation #10241

Describe alternatives you've considered
We can stick to the manual configuration but we should keep in mind that this will require some sort of autoconfiguration itself.

Additional context
Also, I want to say that the best is to stick with vendor-agnostic implementations and see what the Open Telemetry standard provided itself.

npepinpe · 2022-07-11T12:03:39Z

Hey @aivinog1! I think client side is mostly done, since we can just leverage the existing gRPC interceptors for it. Let me know if you think these are insufficient.

For now we could focus on the gateway side. We do plan to switch the internal cluster communication from the home-grown, Netty based transport, to gRPC. I can't really say when, unfortunately, but it would then greatly simplify OpenTelemetry integration.

At any rate, what is your plan here? We currently output Prometheus metrics, which is admittedly vendor specific. We did think about switching generally to Micrometer, and I think this might be a more worthwhile endeavor long term. It means increased integration capabilities with various metrics backend, and it's still compatible with OpenTelemetry (you need to enable the OTLP registry and you're good to go).

Regarding tracing, this is a big missing piece, but again shouldn't be too hard to just stick to the gRPC defaults for now. Then it's about defining how fine-grained we would like the traces. But I would propose just sticking to the gateway/client communication for now, as it allows us to focus on the new feature and not worry too much about having to implement, say, tracing support in our custom transport.

Hope that makes sense

aivinog1 · 2022-07-13T05:56:14Z

Hey @npepinpe!
Thanks for such a comprehended response 👍 I plan to focus on tracing only, but, if I may, I would propose to add a broker to the tracing chain. Because sometimes we have a problem with only a couple of process instances in the load test environment, it would be a relief if we could understand why this process instance was executed so long.

npepinpe · 2022-07-13T06:59:19Z

Feel free to investigate it, but it's definitely the more complex portion. Especially, how do you deal with aggregated traces/spans? We batch multiple log stream entries (i.e. commands/events) together in a single Raft entry. So if we want to trace the Raft part, an entry may contain multiple commands which would be part of possibly different traces (note: this isn't true right now, they're always implicitly part of the same process instance for example, but I wouldn't rely on this as it's very much implementation detail and not by design). One option is to push down the trace ID/span ID/context to the record level in the record metadata. That's probably OK-ish for the IDs, but once users start putting in context it might become quite heavy. We could alternatively only serialize it when we know the trace will be sampled (this used to be part of OpenTracing, hopefully it was kept in OpenTelemetry).

Anyway, don't hesitate to investigate though, I look forward to what you find. Just having an idea of what exactly we want to trace (i.e. when to start/close a trace, when to start/close a span) would be a big step.

aivinog1 · 2022-07-25T15:10:16Z

Hey @npepinpe.
I have dug up a little and I have something to share about this topic :) So, I am stuck with the Actor framework you are using. It seems that the main problem is the trace context propagation when you are calling Actor#run. It happens because when you insert a new job you release the current thread and this job will be executed lately.
https://github.com/camunda/zeebe/blob/14e98f8f81e5003085c4efe42945e907d154d754/scheduler/src/main/java/io/camunda/zeebe/scheduler/ActorTask.java#L507-L509
So, the first thing that came to my mind was to create some metadata map in the ActorJob to store the current trace context and fill it up when this ActorJob will be acquired. If you are okay with this idea of metadata I will add this and continue digging up :)
By the way, autoconfiguration works fine with GRPC API, so Zeebe Gateway has its context, but I think we can do more with this since the Opentraicing agent already patches Netty's Channel. Also, I have some bad feelings about performance (especially startup time), but it's easy to disable the Java agent by default and keep the performance the same.

aivinog1 · 2022-08-30T13:59:39Z

Hey @npepinpe!
I've played a little with the Opentelemetry framework and I want to share this:

I've achieved these results:
1.1. So, you can see interactions between Zeebe Gateway and Broker (it isn't perfect but you can get the point).
1.2. Also there is no single span between io.atomix.cluster.messaging.impl.RemoteClientConnection#sendAndReceive and io.atomix.cluster.messaging.impl.RemoteClientConnection#dispatch but it is pretty readable.
1.3. What I have done to see this picture:
1.3.1. Build the project:

mvn clean install -DskipTests -DskipChecks
docker build --no-cache --load --build-arg DISTBALL=dist/target/camunda-zeebe-*.tar.gz --build-arg APP_ENV=dev -t camunda/zeebe:current-test .

1.3.2. Start the environment: docker-compose -f docker/compose/opentelemetry/docker-compose.yaml up -d
1.3.3. Deploy the simple diagram: test-process-simple.bpmn.zip
1.3.4. Start the process: zbctl create instance Process_0n10xef --insecure --withResult --requestTimeout 30s
1.3.5. And immediately start the test worker in the other tab: zbctl create worker test --insecure --handler 'echo {}' --pollInterval 2m
1.3.6. You should get approximately this output:

2022/08/30 20:32:10 Activated job 2251799813685421 with variables {}
2022/08/30 20:32:10 Handler completed job 2251799813685421 with variables {}

1.3.7. Open the Jaeger: http://localhost:16686/
1.3.8. Filter by the gateway_protocol.Gateway/ActivateJobs operation.
2. What I have done:
2.1. I used automatic and manual instrumentation to propagate trace context and start my spans.
2.3. Also, I used the Opentelemetry collector to collect records and Jaeger to build traces.
3. Notions.
3.1. I couldn't find any guides about using automatic and manual instrumentation simultaneously, so it is all questionable but it works :)
3.2. The code is full of garbage and isn't good at all, but I want to create the draft to see how it works :)
4. Next steps:
4.1. So, because there are a lot of changes I want to separate them into several tasks:
4.1.1. Setup the automatic instrumentation
4.1.2. Make the Actor framework propagate tracing context
4.1.3. Create the third version of the protocol between Zeebe Gateway and Broker
4.1.4. Manual instrumentation interactions between Zeebe Gateway and Broker
4.1.5. Create some documentation
4.1.6. Propagate significant ids as attributes (for example, process instance ids, job ids, etc) to ease the search of needed traces where it is applicable.
4.1.7. I can't say what impact there will be on performance, but we should instantiate the OpenTelemetry once and then do some performance testing.

So, it would be cool to see what you think about it :)

tgdfool2 · 2023-05-08T07:21:20Z

Hi @felix-mueller and @aivinog1. Any news on this feature? We would love to be able to send traces from our Brokers and Gateways to our Grafana Tempo instances.

Is this something that we could expect to see available in the near future?

(BTW, I took over this task from @darox who was giving some information here: #10241 (comment))

aivinog1 · 2023-05-08T08:58:33Z

Hi @tgdfool2 👋
Right now it is my low priority, so I think that I can't find time to work on it within the 8.3 release.
You could use the info from #10597 (as you can see, this is declined) to enable
the automatic instrumentation (basically HTTP/2 GRPC calls).

tgdfool2 · 2023-05-09T11:45:24Z

Thanks for your feedback @aivinog1. In this case we will wait until the OTEL will be officially added to the product.

aivinog1 added the kind/feature Categorizes an issue or PR as a feature, i.e. new behavior label Jul 8, 2022

menski changed the title ~~I, as a Zeebe user, want to see Open Telemetry support for Zeebe~~ As a Zeebe user, I want to see Open Telemetry support for Zeebe Jul 8, 2022

aivinog1 mentioned this issue Sep 1, 2022

[OpenTelemetry] Add the OpenTelemetry automatic instrumentation #10241

Open

megglos assigned felix-mueller Sep 30, 2022

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

aivinog1 commented Jul 8, 2022 •

edited

npepinpe commented Jul 11, 2022

aivinog1 commented Jul 13, 2022

npepinpe commented Jul 13, 2022

aivinog1 commented Jul 25, 2022

aivinog1 commented Aug 30, 2022

tgdfool2 commented May 8, 2023

aivinog1 commented May 8, 2023

tgdfool2 commented May 9, 2023

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

Comments

aivinog1 commented Jul 8, 2022 • edited

npepinpe commented Jul 11, 2022

aivinog1 commented Jul 13, 2022

npepinpe commented Jul 13, 2022

aivinog1 commented Jul 25, 2022

aivinog1 commented Aug 30, 2022

tgdfool2 commented May 8, 2023

aivinog1 commented May 8, 2023

tgdfool2 commented May 9, 2023

aivinog1 commented Jul 8, 2022 •

edited