Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

Open
1 task
aivinog1 opened this issue Jul 8, 2022 · 8 comments
Open
1 task

As a Zeebe user, I want to see Open Telemetry support for Zeebe #9742

aivinog1 opened this issue Jul 8, 2022 · 8 comments
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/feature Categorizes an issue or PR as a feature, i.e. new behavior

Comments

@aivinog1
Copy link
Contributor

aivinog1 commented Jul 8, 2022

Is your feature request related to a problem? Please describe.
This issue originates from my comment. So, the main idea is to add Open Telemetry support for Zeebe, to figure out where time was spent. This task is about investigating and providing MVP and not about a comprehended solution.

Describe the solution you'd like
I think that the best is to stick with the OpenTelemetry SDK Autoconfigure module and disable it by default via environment variables or properties.

Tasks:

Describe alternatives you've considered
We can stick to the manual configuration but we should keep in mind that this will require some sort of autoconfiguration itself.

Additional context
Also, I want to say that the best is to stick with vendor-agnostic implementations and see what the Open Telemetry standard provided itself.

@aivinog1 aivinog1 added the kind/feature Categorizes an issue or PR as a feature, i.e. new behavior label Jul 8, 2022
@menski menski changed the title I, as a Zeebe user, want to see Open Telemetry support for Zeebe As a Zeebe user, I want to see Open Telemetry support for Zeebe Jul 8, 2022
@npepinpe
Copy link
Member

Hey @aivinog1! I think client side is mostly done, since we can just leverage the existing gRPC interceptors for it. Let me know if you think these are insufficient.

For now we could focus on the gateway side. We do plan to switch the internal cluster communication from the home-grown, Netty based transport, to gRPC. I can't really say when, unfortunately, but it would then greatly simplify OpenTelemetry integration.

At any rate, what is your plan here? We currently output Prometheus metrics, which is admittedly vendor specific. We did think about switching generally to Micrometer, and I think this might be a more worthwhile endeavor long term. It means increased integration capabilities with various metrics backend, and it's still compatible with OpenTelemetry (you need to enable the OTLP registry and you're good to go).

Regarding tracing, this is a big missing piece, but again shouldn't be too hard to just stick to the gRPC defaults for now. Then it's about defining how fine-grained we would like the traces. But I would propose just sticking to the gateway/client communication for now, as it allows us to focus on the new feature and not worry too much about having to implement, say, tracing support in our custom transport.

Hope that makes sense

@aivinog1
Copy link
Contributor Author

Hey @npepinpe!
Thanks for such a comprehended response 👍 I plan to focus on tracing only, but, if I may, I would propose to add a broker to the tracing chain. Because sometimes we have a problem with only a couple of process instances in the load test environment, it would be a relief if we could understand why this process instance was executed so long.

@npepinpe
Copy link
Member

Feel free to investigate it, but it's definitely the more complex portion. Especially, how do you deal with aggregated traces/spans? We batch multiple log stream entries (i.e. commands/events) together in a single Raft entry. So if we want to trace the Raft part, an entry may contain multiple commands which would be part of possibly different traces (note: this isn't true right now, they're always implicitly part of the same process instance for example, but I wouldn't rely on this as it's very much implementation detail and not by design). One option is to push down the trace ID/span ID/context to the record level in the record metadata. That's probably OK-ish for the IDs, but once users start putting in context it might become quite heavy. We could alternatively only serialize it when we know the trace will be sampled (this used to be part of OpenTracing, hopefully it was kept in OpenTelemetry).

Anyway, don't hesitate to investigate though, I look forward to what you find. Just having an idea of what exactly we want to trace (i.e. when to start/close a trace, when to start/close a span) would be a big step.

@aivinog1
Copy link
Contributor Author

Hey @npepinpe.
I have dug up a little and I have something to share about this topic :) So, I am stuck with the Actor framework you are using. It seems that the main problem is the trace context propagation when you are calling Actor#run. It happens because when you insert a new job you release the current thread and this job will be executed lately.
https://github.com/camunda/zeebe/blob/14e98f8f81e5003085c4efe42945e907d154d754/scheduler/src/main/java/io/camunda/zeebe/scheduler/ActorTask.java#L507-L509
So, the first thing that came to my mind was to create some metadata map in the ActorJob to store the current trace context and fill it up when this ActorJob will be acquired. If you are okay with this idea of metadata I will add this and continue digging up :)
By the way, autoconfiguration works fine with GRPC API, so Zeebe Gateway has its context, but I think we can do more with this since the Opentraicing agent already patches Netty's Channel. Also, I have some bad feelings about performance (especially startup time), but it's easy to disable the Java agent by default and keep the performance the same.

@aivinog1
Copy link
Contributor Author

Hey @npepinpe!
I've played a little with the Opentelemetry framework and I want to share this:

  1. I've achieved these results: image
    1.1. So, you can see interactions between Zeebe Gateway and Broker (it isn't perfect but you can get the point).
    1.2. Also there is no single span between io.atomix.cluster.messaging.impl.RemoteClientConnection#sendAndReceive and io.atomix.cluster.messaging.impl.RemoteClientConnection#dispatch but it is pretty readable.
    1.3. What I have done to see this picture:
    1.3.1. Build the project:
mvn clean install -DskipTests -DskipChecks
docker build --no-cache --load --build-arg DISTBALL=dist/target/camunda-zeebe-*.tar.gz --build-arg APP_ENV=dev -t camunda/zeebe:current-test .

1.3.2. Start the environment: docker-compose -f docker/compose/opentelemetry/docker-compose.yaml up -d
1.3.3. Deploy the simple diagram: test-process-simple.bpmn.zip
1.3.4. Start the process: zbctl create instance Process_0n10xef --insecure --withResult --requestTimeout 30s
1.3.5. And immediately start the test worker in the other tab: zbctl create worker test --insecure --handler 'echo {}' --pollInterval 2m
1.3.6. You should get approximately this output:

2022/08/30 20:32:10 Activated job 2251799813685421 with variables {}
2022/08/30 20:32:10 Handler completed job 2251799813685421 with variables {}

1.3.7. Open the Jaeger: http://localhost:16686/
1.3.8. Filter by the gateway_protocol.Gateway/ActivateJobs operation.
2. What I have done:
2.1. I used automatic and manual instrumentation to propagate trace context and start my spans.
2.3. Also, I used the Opentelemetry collector to collect records and Jaeger to build traces.
3. Notions.
3.1. I couldn't find any guides about using automatic and manual instrumentation simultaneously, so it is all questionable but it works :)
3.2. The code is full of garbage and isn't good at all, but I want to create the draft to see how it works :)
4. Next steps:
4.1. So, because there are a lot of changes I want to separate them into several tasks:
4.1.1. Setup the automatic instrumentation
4.1.2. Make the Actor framework propagate tracing context
4.1.3. Create the third version of the protocol between Zeebe Gateway and Broker
4.1.4. Manual instrumentation interactions between Zeebe Gateway and Broker
4.1.5. Create some documentation
4.1.6. Propagate significant ids as attributes (for example, process instance ids, job ids, etc) to ease the search of needed traces where it is applicable.
4.1.7. I can't say what impact there will be on performance, but we should instantiate the OpenTelemetry once and then do some performance testing.

So, it would be cool to see what you think about it :)

@tgdfool2
Copy link

tgdfool2 commented May 8, 2023

Hi @felix-mueller and @aivinog1. Any news on this feature? We would love to be able to send traces from our Brokers and Gateways to our Grafana Tempo instances.

Is this something that we could expect to see available in the near future?

(BTW, I took over this task from @darox who was giving some information here: #10241 (comment))

@aivinog1
Copy link
Contributor Author

aivinog1 commented May 8, 2023

Hi @tgdfool2 👋
Right now it is my low priority, so I think that I can't find time to work on it within the 8.3 release.
You could use the info from #10597 (as you can see, this is declined) to enable
the automatic instrumentation (basically HTTP/2 GRPC calls).

@tgdfool2
Copy link

tgdfool2 commented May 9, 2023

Thanks for your feedback @aivinog1. In this case we will wait until the OTEL will be officially added to the product.

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/feature Categorizes an issue or PR as a feature, i.e. new behavior
Projects
None yet
Development

No branches or pull requests

5 participants