Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Productionize Streaming Jobs for Service Dependencies #4590

Open
yurishkuro opened this issue Jul 18, 2023 · 17 comments
Open

Productionize Streaming Jobs for Service Dependencies #4590

yurishkuro opened this issue Jul 18, 2023 · 17 comments
Labels
enhancement help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Jul 18, 2023

Currently we have two analytics solutions for generating service maps:

  • Jaeger Analytics Flink
    • Real time streaming, requires Kafka.
    • More feature rich, includes code for both 1-hop and transitive dependency graphs -- https://www.jaegertracing.io/docs/1.47/features/#topology-graphs
    • Aggregates data for a given time window (originally at Uber - 15min) and writes a summary snapshot to storage
    • Not easy deployment solution is provided in the repository.
  • Spark Dependencies
    • Batch job that reads all data for a period of time, aggregates, and writes a summary snapshot to storage.
    • Does not require Kafka.
    • Theoretically can be run as frequently as 15min to produce similar results as Flink jobs above, but the implementation for Cassandra may need to be tweaked for that.
    • Does not support transitive dependency graphs.

Objectives:

  • Ideally we want a single code base that supports both types of service dependencies
  • The solution needs to be documented, packaged (e.g. published containers) and easy to deploy (e.g. with docker compose or k8s operator)
  • Supporting both batch (goes directly against span storage) and streaming (reads from Kafka) is nice to have
@yurishkuro yurishkuro added mentorship help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Jul 18, 2023
@tronda
Copy link

tronda commented Jul 19, 2023

Would one possible implementation to be to use ServiceGraphConnector to create service dependencies or is it not suitable for Jaeger's dependencies diagrams?

@mohamedawnallah
Copy link

mohamedawnallah commented Jul 22, 2023

Expression of Interest in this Mentorship Project - Productionize Streaming Jobs for Service Dependencies

Hello everyone,

I am genuinely interested in participating in this project as part of the LFX mentorship program in Q3. I have a strong understanding of the Distributed Tracing domain, having read the entire book Mastering Distributed Tracing.

Additionally, I have relevant experience that I believe aligns well with the objectives of the project.

Specifically, I have installed Jaeger in a production environment using the Kubernetes Operator. Additionally, I have configured Spark jobs to detect one-hop service dependencies in a simple instrumented application, comprising two services. One of these services fetches the IP address from a remote API endpoint, while the other formats the data.

My Instrumented Application Tracing Architecture

My Instrumented Application Architecture

My Instrumented Application in Production Cluster

jaeger-production-cluster

Questions

  1. I'm interested in the possibility of customizing this project to support batch and streaming processing for handling various service dependencies. However, I have a question: Will this new solution replace the existing Jaeger Analytics Flink and Spark Dependencies, or will it work alongside them?
  2. During my exploration, I came across the jaeger-analytics-java repository. I'm curious to know how it fits into the overall project idea and if it brings additional value to the initiative.

Follow-up research resources

As I prepare to contribute, I would greatly appreciate it if you could recommend any additional resources or documentation to help me better understand this project and its specific requirements.

Looking forward to participating in this exciting endeavor!

@yurishkuro
Copy link
Member Author

@mohamedawnallah I think it's worth looking into jaeger-analytics-java as well and deciding how it fits or overlaps with the rest. Ideally I would like to see a single repo / single trace analytics library that can be used with different streaming solutions.

@mohamedawnallah
Copy link

mohamedawnallah commented Jul 26, 2023

I wanted to share my progress so far on this issue. I have gained a clear understanding of the jaeger-analytics-java repository and its role in this project context. It serves as a Trace DSL (Domain Specific Language) metrics analytics and I've executed it on Jupyter Notebooks to comprehend its functionality, referring to the helpful Jaeger tracing article on Medium.

Additionally, I have introduced two new analytics metrics for the example hotrod application:

  1. Calculated the Average Duration of Traces
  2. Identified the Most Common Error Types

These metrics have been essential in understanding the Trace DSL API and the implementation of the Gremlin Query/Traversal Language from the Apache Tinkerpop Project.

Furthermore, I have observed that jaeger-analytics-java provides a metric for Service's direct downstream dependencies. To obtain this metric, we need to run the corresponding job that specifically supports 1-hop service dependencies.

@mohamedawnallah
Copy link

mohamedawnallah commented Jul 26, 2023

I'm currently considering an implementation approach for this project. One idea is to enhance the Jaeger Analytics Flink repository's deployability, making it easier to set up. By pursuing this approach, we can meet at least the bare minimum requirements for this project to support both service dependencies (1-hop and transitive dependencies) while ensuring a straightforward deployment process.

@yurishkuro, I'd love to hear your thoughts on this!

@yurishkuro
Copy link
Member Author

What I am curious about is whether it's possible to consolidate streaming business logic into a library L so that the same library could be used with multiple streaming runtimes, e.g.

Flink source/runtime -> L -> Flink sink
Spark source/runtime  -> L -> Spark sink

Few years ago it wasn't possible because Spark and Flink used different APIs to describe the transformation flows. But since Java Streams were introduced, I was under impression that the UDFs could be expressed in Java Streams and work for both. This is just my assumption, would be good to confirm.

The reason why I think it's useful to have this reusability is because supporting Spark allows offline batch processing, which may be a useful feature for some, not to mention that some organizations are running only Spark and not Flink.

@mohamedawnallah
Copy link

mohamedawnallah commented Jul 27, 2023

@yurishkuro I have recently explored the idea of consolidating streaming business logic into a library to make it compatible with multiple streaming runtimes, such as Apache Flink and Apache Spark.

While the Java Stream API might initially seem like a good fit, I discovered that it is designed for processing data within a single JVM, making it suitable for in-memory processing on a single machine. The distributed version of Java Stream API is DStream. For more in-depth information, there is a paper on DStream from the University of York that delves into its specifics and also discusses the issues with Java Stream API: Dstream Paper.

The DStream (Discretized Stream) is employed as a low-level API design in Apache Spark for its streaming capabilities. However, Dstream is not supported in Apache Flink, and it works through microbatching, which doesn't qualify as a true streaming framework. Though Spark is now experimenting with Continuous Streaming Processing, it's still not mature as the streaming capabilities in Apache Flink.

In contrast, Apache Flink simplifies the process with a unified DataStream API, which can handle both batch and streaming processing modes without the need to rewrite code. This feature makes Flink a more flexible choice, especially for organizations that wanna use a single data processing platform with the same API for both batch and stream processing.

In conclusion, while Java Streams might be not suitable for the desired cross-runtime compatibility, Apache Flink's DataStream API offers a promising solution for building reusable streaming and batching business logic that can be deployed seamlessly.

@yurishkuro I also would love to hear your thoughts on this!

@mohamedawnallah
Copy link

Hey @yurishkuro I'd still like to work on this issue outside the official LFX mentorship. Any thoughts?

@yurishkuro
Copy link
Member Author

@mohamedawnallah most of our code is already written for Flink, so it's fine to keep it and package for prod deployment.

@mohamedawnallah
Copy link

mohamedawnallah commented Aug 21, 2023

Great so by packaging Jaeger Analytics Flink for production means:

  1. Package both the 1-hop service dependencies (Dependencies Job) and the transitive dependencies (Deep Dependencies Job) into Docker containers using a Docker compose file.
  2. Inject Data Sink Dependencies i.e. Apache Cassandra configurations as command line arguments while running the docker-compose file.
  3. Document how to publish Jaeger Analytics Flink in Production and their relevant containers.

@yurishkuro I'd love to hear your thoughts on this and if there is anything I'm missing

@yurishkuro
Copy link
Member Author

Yes, plus (4) set up CI integration tests to validate that those packages are operational.

But on (1), the docker-compose is not the "production" packaging, usually it's just an example & integration test, while the actual packaging is just the runnable Docker images. Another option is to extend the K8S Operator to support deployment of these images too (when used with Kafka, of course).

@yurishkuro
Copy link
Member Author

on (2), it would be good to support other backends too, not just Cassandra (at minimum ES / OS).

@mohamedawnallah
Copy link

Thanks @yurishkuro for your additions. I know "ES" stands for ElasticSearch but What does "OS" stand for in the storage?

@yurishkuro
Copy link
Member Author

OpenSearch

@mohamedawnallah
Copy link

Okay, I'm gonna start working on the issue but I'd like to know if you've any suggestions about communication while working on this issue Also the repository dedicated to this issue is Jaeger Analytics Flink?

@yurishkuro
Copy link
Member Author

yurishkuro commented Aug 25, 2023

suggestions about communication while working on this issue

I would recommend creating a proposal / plan of what you plan to do and how. This is not a 1-day project, so the plan should contain multiple milestones. We can copy them as a checklist into the ticket description and tick off as each milestone is reached. This would provide good visibility on the progress.

@mohamedawnallah
Copy link

mohamedawnallah commented Aug 25, 2023

Sounds great!! I'm gonna send a proposal soon of what I plan to do and how regards Productionize Streaming Jobs for Service Dependencies project. Would I send it on Slack private messaging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

3 participants