Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add GraphQL query context based tracing #79

Conversation

toneymathews
Copy link

@toneymathews toneymathews commented Jul 5, 2022

opentelemetry-instrumentation-graphql provides a GraphQLTracer (source) based on GraphQL::Tracing::PlatformTracing (source). This includes several schema level configuration options that can enable/disable specific parts of tracing.

The current options are global, but we could benefit from also having per request configuration. For example, a HTTP header may request verbose traces be recorded, ideally enabling enable_platform_field for only this request’s GraphQL execution. This way a client can be permitted to opt into tracing, without having to enable it for all requests.

To make this possible, the GraphQL execution context seemed like an ideal place to have this request specific configuration. Access to context is now possible as of graphql-ruby 1.13.13 and 2.0.9 (see changelog, added in #4077), meaning we have this option going forward, but potentially need to remain backwards compatible.

In cached_platform_key, the newer versions of graphql-ruby have both context and trace_phase, so we can determine which setting to check and which of platform_field_key, platform_authorized_key, platform_resolve_type_key would be called on a cache miss.

If we were to release a new version of opentelemetry-instrumentation-graphql, how should we handle checking the version of graphql-ruby, either being backwards compatible or having a minimum version?

The CI failures have come up since cached_patform_key now expects 3 arguments (trace_phase being the new one) in newer versions of graphql-ruby.

cc @ravangen

@toneymathews toneymathews marked this pull request as draft July 5, 2022 22:58
@toneymathews toneymathews force-pushed the add-query-context-based-tracing branch from e2ada4d to 0d61e40 Compare July 6, 2022 14:50
@toneymathews toneymathews changed the title [wip] feat: Add GraphQL query context based tracing feat: Add GraphQL query context based tracing Jul 6, 2022
@toneymathews toneymathews marked this pull request as ready for review July 6, 2022 17:42
@plantfansam
Copy link
Contributor

Thanks so much for the contribution @toneymathews! I need to familiarize myself with the changes in graphql-ruby and how cached_platform_key fits into the overall upstream tracing strategy before commenting on that.

On the topic of version compatibility: generally speaking, each instrumentation gem's Instrumentation class, which subclasses Instrumentation::Base, calls a compatible block to determine if the version of the gem being instrumented is compatible with our instrumentation library. I think that our current GraphQL Instrumentation class does not call this block, but it probably should.

If we imagine a world where there is a compatible block in the our GraphQL Instrumentation class, then most likely we'd need to decide if the feature being proposed here is worth breaking compatibility with older versions.

A couple alternatives to that:

  • We could dynamically select different code paths depending on which version of the gem is installed (that sounds like a maintenance headache)
  • There might be some clever way to work around the problem you're facing in this specific instance, which means we might not have to solve the general problem.

The other @open-telemetry/ruby-contrib-maintainers may have some thoughts on a more holistic strategy for managing compatibility, here.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 14, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: toneymathews / name: Toney Mathews (0d61e40, d4fbc74)
  • ✅ login: ravangen / name: Rob Van Gennip (e789111)

@toneymathews
Copy link
Author

toneymathews commented Jul 15, 2022

For ease of explanation, I'll be referring to graphql-ruby versions in the below form.
Old versions: graphql-ruby < 1.13.13 and (> 2 and < 2.0.9)
New versions: graphql-ruby >= 1.13.13 and >= 2.0.9. These have query execution context based tracing.

We have added backward compatibility for old versions in this commit.

However, one test case we’ve added will have different outcomes based on the graphql-ruby version. This test is skipped in old versions.

When “enable_platform_field”, “enable_platform_authorized” and “enable_platform_resolve_type” are enabled(set to true) in configuration, but are disabled(set to false) within query execution context,

  • The corresponding keys would not be traced in new versions of graphql-ruby
  • They would be traced in old versions since we do not check their state in query execution context.

It is detailed in the below table for the tests added

Test scenario for configuration Test scenario for query execution context Whether they are traced or not for new versions Whether they are traced or not for old versions?
Keys are enabled Key are enabled Yes Yes
Keys are enabled Keys are disabled No Yes (test is skipped)
Keys are enabled Key are not set Yes (restores original behavior) Yes
Keys are disabled Keys are enabled No No
Keys are disabled Keys are disabled No No
Keys are disabled Keys are not set No No

At this point, this is supported across different versions in the below manner

Graphql-ruby Opentelemetry-ruby-contrib without PR changes Opentelemtery with the PR changes changes
Old versions Does not matter Supported
New versions Supported Supported

I’ve tested this locally with graphql-ruby 1.13.13 as well. This is not added in the “Appraisals” file. Does that need to be added in this PR?. This is now added within Appraisals as well.

@plantfansam
Copy link
Contributor

Thanks for adding to Appraisals @toneymathews! Will take a look.

@robertlaurin robertlaurin self-assigned this Jul 20, 2022
@robertlaurin
Copy link
Contributor

I've assigned this to myself, I'll work towards giving this a proper thorough review this week. It has a few implications, and unspecified behaviour so we need to be really detailed here.

@toneymathews toneymathews force-pushed the add-query-context-based-tracing branch from 7af15b8 to 0d51457 Compare July 22, 2022 02:40
@robertlaurin
Copy link
Contributor

Hey, so apologies for the delay on the review here.

To be honest I've been going over the high level intention of this PR more than the code itself.

So the simplified high level break down I see here consists of three configurable states per field, I will use platform field for the following example states.

  • Always create a span for the platform field
  • Never create a span for the platform field
  • Create a span for the platform field on a per request basis

So we have on/off/per request. Configuring on and off states is easy as we already have that functionality implemented.

So now to the meat of this PR, per request span creation.

The approach taken makes use of GraphQL specific implementation details to enable the tracing of said field, which I think is a very pragmatic approach.

For "fun" let's add a some distributed tracing flavoured scope creep to this new feature.

Consider the following hypothetical workflow.

//start winded example
I have a frontend client that allows users to place orders. My frontend client uses a GraphQL api to communicate directly with ServiceA. ServiceA is responsible for creating the order and surfacing choices to the frontend client during the order creation flow. Some of the choices provided to the frontend client are sourced from ServiceB which also relies on ServiceC.

I'm a developer debugging this workflow and am trying to understand why for some users the order creation process times out.

So I enabled all the GraphQL spans for ServiceA via the functionality this PR surfaces. I get all the spans, I read my trace, and discover the issue isn't in serviceA. I suspect something bad is happening in ServiceC. How do I proceed? When testing the workflow I'm making requests against ServiceA, making a direct request to ServiceC is sort of tricky and doesn't capture a real world scenario. How do I tell ServiceC to enable all the GraphQL spans?
//end winded example

I'm intentionally trying to lead us to consider using baggage here instead of the GraphQL specific context. It's a mechanism for propagating context across services boundaries using a W3C specced header created exactly for this use case. It would allow us as users to add a baggage header that says something like "baggage: ServiceC=enable_platform_field;" which could signal which application we want to enable high the verbosity request tracing.

Nothing in this PR or my wall of text above is formally specced by OpenTelemetry so we have to be very mindful of how we implement this. We have some latitude to experiment while the instrumentation is not 1.0'd but we have to very mindful of the lasting consequences of our choices here.

This needs further discussion from @open-telemetry/ruby-approvers @open-telemetry/ruby-maintainers

@ravangen
Copy link
Contributor

ravangen commented Jul 27, 2022

Thank you for taking the time to review the suggestion and give it some thought.

GraphQL is not necessarily invoked through an HTTP request, although is the most common scenario. GraphQL being HTTP agnostic means that something needs to tell GraphQLTracer how to read the active execution's configuration.

Practically speaking, I am unclear on an alternative to how the current request object or configuration is proposed to be passed down into GraphQLTracer. Note that context was not usable prior to graphql-ruby < 1.13.13 and >= 2 and < 2.0.9.

HTTP baggage header is an option that can inform what is set in the GraphQL execution context. The current proposed approach is unopinionated, leaving it to the application code to know how to set these optional execution level settings.

It would allow us as users to add a baggage header that says something like "baggage: ServiceC=enable_platform_field;" which could signal which application we want to enable high the verbosity request tracing.

A frontend client may not know the "service name" to put into a baggage header, and instead elect to just say "debug my request" through a basic query argument or request header for example. Some app logic could still transform such a flag into a baggage header for propagation internally.

Regardless, these changes can work with distributed tracing. Configuration can still be propagated across service boundaries using whatever tooling exists today I imagine. GraphQL wouldn't be concerned with this propagation as it is not the mechanism directly calling other services, instead it would be other clients (which are likely instrumented).

@github-actions
Copy link
Contributor

👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this pull request will be closed eventually by the stale bot

@github-actions github-actions bot added the stale Marks an issue/PR stale label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Marks an issue/PR stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants