Skip to content
Nicolas Pepin-Perreault edited this page Feb 26, 2020 · 9 revisions

The goal here is to have a central page where we can collect ideas that would be interesting to explore for Zeebe. Essentially a collection of topics that are relevant to Zeebe and its development - ranging from feature ideas, development process proposals, tooling changes, etc.

For each idea, there is a list of people interested in exploring it - this does not mean they have to work on it, but simply a way to show interest. That way, for example, if you want to start a proof of concept about one of these ideas, you can contact that person as they may already have some preliminary ideas to share with you.

Distributed Tracing

Interested in: @npepinpe

Distributed applications are notoriously more difficult to observe than their more centralized counterparts - tracing the lifetime of an entity (e.g. request) across 1, 5, or even 100 hops, increases the burden on the poor developer when it comes time to diagnose an issue - what caused this particular request to fail? to be slower? to be faster? to go through this service instead of that one?

By capturing context at various boundaries, we can propagate it across those same boundaries, allowing us to effectively capture the causal chain of events that make up the life of a single request through our distributed application.

While the main use case is to help the developers to diagnose issues (especially related to outliers), we can leverage the same tools for distributed profiling, inter-service dependency analysis, capacity planning, etc.

Proof of concept: https://github.com/zeebe-io/zeebe/pull/3341

Integration with Stackdriver

OpenCensus, having originated at Google, has much tighter integration with Stackdriver (and many Google projects, e.g. gRPC), which is a major plus when it comes to our development workflow. As the next official standard, OpenTelemetry, will provide bridge adapters for both OpenCensus and OpenTracing, it's definitely a strong point for us picking OpenCensus (i.e. no need for us to operate an external tracer, e.g. Jaeger).

One of the common complaints about OpenTracing is usually the poor UI/UX of its implementations (at least its FOSS implementations). Here's where Stackdriver can help us with its Stackdriver Trace offering.

Trace Overview Trace details Trace Analysis Report

Links

Structured Logging

Interested in: @npepinpe

Integration with Stackdriver

Interested in: @npepinpe

Integration with Stackdriver

Interested in: @npepinpe

Interested in: @npepinpe

Interested in: @npepinpe

Data integrity checks

Zeebe currently only performs very basic integrity checks when starting up - it will scan the log and compare the checksum of each entry to the real checksum. This detects data corruption, but does not detect bugs such as inconsistent log ordering, for example. It would be interesting to add more integrity checks - ideally without slowing down the runtime cost (start up cost is probably okay-ish to increase).

Interested in: @npepinpe , @zelldon

Akka

Interested in: @npepinpe , @zelldon

Links

Interested in: @zelldon

Quarkus tailors your application for GraalVM and HotSpot. Amazingly fast boot time, incredibly low RSS memory > (not just heap size!) offering near instant scale up and high density memory utilization in container orchestration platforms like Kubernetes. We use a technique we call compile time boot

A modern, JVM-based, full-stack framework for building modular, easily testable microservice and serverless applications.

Interested in: @zelldon, @npepinpe

Different VM's and GC's

Would like to test out different VM's like Graal and GC's like Zulu etc.

Interested in: @zelldon

Build a extension/plugin to chaos test Zeebe easier.

Interested in: @zelldon

[RocksDB as Log]

Interested in: @npepinpe

The motivation here is that many existing Raft implementation delegate management of the persistent log storage to RocksDB or LMDB. I have a hunch that they will both perform much better than our current journal, and will most likely be more stable, so I would like to test them out.

Interested in: @npepinpe

It seems it's possible to profile distributed applications (to what extent?) using Google Cloud Profiler. It would be interesting to get more knowledge in this area.