Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional instrumentation for recording GraphQL response field lengths in OTel #5199

Closed
wants to merge 17 commits into from

Conversation

tninesling
Copy link
Contributor

Overview

Adds a new instrumentation config, graphql, which supports a single metric called field.length. When enabled, this will publish the lengths of array fields returned in primary supergraph responses. This is primarily meant to help debug unexpected cost values calculated by the demand control plugin, as these discrepancies are multiplied by the length of lists in the responses.

Primary responses only

Note that this implementation does not work for deferred responses. The primary blocker for this is that we don't currently have a way to zip a response with a query when that response doesn't start at the query root. To make this work, we would need to take the deferred response's json path and determine which subsection of the schema we should use for the zip procedure.

No support for custom attributes

The other instrumentation configurations support custom metrics using predefined attributes, for example, you can create a custom router metric based on the http response status code. This functionality comes from the custom histogram/attribute/selector framework we've implemented, but this GraphQL field-related code does not seem to fit cleanly into those existing abstractions. In the interest of time, I've settled on creating this one-off metric which is not extensible and cannot be used in custom metrics.

No support for conditions

One change not included in this PR that we will need to add is support for filtering via conditions. This metric will be published for every list field across all responses when enabled, which has the potential to produce far more information than is useful or wanted. The existing conditions implementation is likely not compatible with this implementation as-is because we need to check a given condition for each field in the response when determining if we should publish the metric or not. The current conditions setup will cache any evaluated condition, such that if the condition is true once, it will be rewritten to a static true condition that will not be re-evaluated. We will need to create some uncached equivalent which can be evaluated several times within a single request pipeline to be used with this field length metric. That will be coming in the next PR.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@tninesling tninesling requested a review from BrynCooke May 17, 2024 21:15
Copy link
Contributor

@tninesling, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented May 17, 2024

CI performance tests

  • step - Basic stress test that steps up the number of users over time
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xlarge-request - Stress test with 10 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • no-graphos - Basic stress test, no GraphOS.
  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • const - Basic stress test that runs with a constant number of users

@tninesling
Copy link
Contributor Author

This was redone in #5215

@tninesling tninesling closed this May 28, 2024
@tninesling tninesling deleted the tninesling/graphql-instruments branch May 28, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant