Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine where we are with network monitoring in the cluster #1785

Closed
7 tasks done
Jose-Matsuda opened this issue Jul 19, 2023 · 6 comments
Closed
7 tasks done

Determine where we are with network monitoring in the cluster #1785

Jose-Matsuda opened this issue Jul 19, 2023 · 6 comments
Assignees

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jul 19, 2023

Purpose of this ticket is to see where we are in relation to #914

Steps (these should all be their own separate comments for easy linking)

@Jose-Matsuda Jose-Matsuda self-assigned this Jul 19, 2023
@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 19, 2023

Review All Tickets (RAT)

#915 - determined that Jaeger was a good path forward (at least when tested locally)
#916 - some information on what Jaeger requires
#1462 Jose and Souheil investigation. Looks like we got the jaeger-operator alive (though now searching for :jaeger in the cluster we have a resource named simple-prod in a failed status.
#1481 has us discovering that jaeger does not support elastic 8.x, and still does not.
#1483 Getting Istio traces from the sidecar.
#1484 was one created from 1462 for networking, but has no work done on it.
#1505 has us bargaining and trying to save the jaeger ship and then exploring different options.
#1521 an offshoot for instrumenting

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 19, 2023

Getting a sense of where we are

Jaeger

Well the biggest points of failure with Jaeger it seems was Elastic compatibility. #1481 and #1505, note that it still does not support elastic 8.x. In 1505 we also try using k8ssandra but that ended in failure.

  • Follow-up question to k8ssandra, why did we try it that way? It almost seems like overkill, especially if we are just using it for Spans. Could we get away with just an image deployment / statefulset?

TLDR for Jaeger why it didn't work out;
Seems like we ran into a problem with storing the spans (permissions errors amongst some for k8ssandra and then compatibility issues for elastic search)

OTEL

Reminder that it is not an observability platform, it's just a standard for creating the telemetry data (still need to do something with it). As I watch this video about instrumenting and having to do that for all our apps, even the ones we don't build ourselves I abhor the idea of needing to do this, regardless of OTEL or Elastic way.
It looks like in our time away, OTEL has auto-instrumentation for GO now, joining Apache HTTPD, DotNet, Java, NodeJS and Python.

TLDR for OTEL;
I think we just explored this as an option to creating our telemetry data and didn't go too far into it as noted in this comment as there are no follow-up issues.

Elastic APM

Doesn't support using istio traces, though it would take OTEL data, and of course whatever format the elastic instrumentation has.

@Jose-Matsuda Jose-Matsuda pinned this issue Jul 20, 2023
@Jose-Matsuda Jose-Matsuda unpinned this issue Jul 20, 2023
@Jose-Matsuda
Copy link
Contributor Author

What are any remaining questions we have?

  • As I look on the Istio distributed tracing overview I see the following; Although Istio proxies can automatically send spans, extra information is needed to join those spans into a single trace. Applications must propagate this information in HTTP headers, so that when proxies send spans, the backend can join them together into a single trace. I do not know if our applications already forward the required headers, maybe this is a try and see thing
  • If this doesn't work out automatically, we will need to consider either changing our apps to forward the required headers or to look into manually instrumenting our apps for say OTEL data (whether manual or automatic.
    • In this case, we should determine which apps are most important / high priority to modify and then just work our way up

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 24, 2023

Next Steps

I kind of want to see how far we can take Jaeger, but with a very simple cassandra deployment and see what data we get.

Path A Jaeger and Cassandra

  • Set up a Jaeger instance using allInOne which uses in memory for storage. This is to get a taste of what Jaeger would look like on our cluster currently without any additional changes.
  • If we don't like it, then we need to evaluate how difficult it would be to add headers for our applications, if that is more complicated or the same amount of work as Path B (at this point go to path B
  • Iff we like it, then we go ahead and set up a Cassandra database to try and get that working.
    • Proceed with a Jaeger production deployment destroying the allInOne in the process

Path B Otel and Jaeger / Elastic APM

This path is chosen if we have no choice but to instrument our apps and add another sidecar container to them.

  • Make a list of our most important applications to instrument first, hopefully the apps chosen are in the list of auto-instrumentable languages that Otel does.
  • Instrument
  • Determine using Jaeger or Elastic APM. If we use Jaeger will still need to find a way to store the spans so maybe Elastic APM would be our choice here.

@Jose-Matsuda
Copy link
Contributor Author

Closing as reviewed with the team; intending on focusing on OTel with Elastic (need to make sure we dont run into any licensing issues) or Jaeger

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jul 26, 2023

Given above, here are some possible tasks

  • Choose application(s) to instrument with OTel. Maybe we should select two connected apps to make sure that traces are happening, as one app may produce a span but the value comes from connecting two together into a trace.
  • Instrument the application(s) using Otel.
  • Investigate using Elastic APM as a backend, ensure that we will not run into any licensing issues as we do not have one.
  • If above is good, then integrate with Elastic by sending data from that Otel agent directly into Elastic. This will avoid us needing to use a collector to send data to the backend(elastic/jaeger) which doesn't seem to have any significant differences to send data immediately
  • Then after we've confirmed some working, we can go on and instrument our other applications with OTel and just continue to configure them to feed into Elastic / Jaeger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant