Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AS4 contains code that makes triggering a crashing bug in graphql@16.6 (fixed in 16.7) more likely #7746

Closed
creativecreature opened this issue Oct 2, 2023 · 8 comments · Fixed by #7747

Comments

@creativecreature
Copy link

Issue Description

Hi,

We recently encountered significant production disruptions at my company where containers were crashing. By examination of the logs, we were able to attribute the disruptions to ApolloServerPluginUsageReporting. We swiftly deployed a new build without the plugin, which fixed the issue.

Today, I extracted logs from our ECS cluster to try and find root cause of the disruptions. Below is a snapshot of the acquired logs:

[
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at Function.from (<anonymous>)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at completeValue (/app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:584:23)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at /app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:696:25"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at completeValue (/app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:635:12)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at completeObjectValue (/app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:914:10)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at executeFields (/app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:413:20)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at executeField (/app/node_modules/.pnpm/graphql@16.6.0/node_modules/graphql/execution/execute.js:481:20)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at field.resolve (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/utils/schemaInstrumentation.ts:76:7)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at invokeWillResolveField (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/requestPipeline.ts:390:33)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at invokeSyncDidStartHook (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/utils/invokeHooks.ts:30:6)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at Array.map (<anonymous>)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at /app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/utils/invokeHooks.ts:30:22"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at /app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/requestPipeline.ts:391:33"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at Object.willResolveField (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/plugin/usageReporting/plugin.ts:565:36)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at TraceTreeBuilder.willResolveField (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/plugin/traceTreeBuilder.ts:92:13)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "    at internalError (/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/plugin/traceTreeBuilder.ts:14:10)"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "/app/node_modules/.pnpm/@apollo+server@4.7.3_graphql@16.6.0/node_modules/@apollo/server/src/plugin/traceTreeBuilder.ts:14"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "  return new Error(`[internal apollo-server error] ${message}`);"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "         ^"
  },
  {
    "@timestamp": "2023-09-29 18:18:19.365",
    "@message": "Error: [internal apollo-server error] willResolveField called after stopTiming!"
  }
]

I proceeded by cloning the source code of both graphql-js and apollographql in an effort to understand this issue. It appears to me that the problem arose due to promises resolving subsequent to the invocation of the treeTraceBuilder stop function. The graphql-js library does not seem to abort promises that are being executed with Promise.all.

While preparing to open an issue in the repository, I stumbled upon this existing issue. It seems that the problem was already identified and a solution was deployed. Nonetheless, upon reviewing the code for treeTraceBuilder in main, I’m unable to locate this fix. Furthermore, git blame implies that there hasn’t been any alteration to this line of code for the past three years.

Have there been alterations in version 4.0 that rendered this fix unnecessary, or is there another element I am overlooking?

Really appreciate any help, thanks!

Link to Reproduction

N/A

Reproduction Steps

N/A

@glasser
Copy link
Member

glasser commented Oct 2, 2023

Hi. You're right that the fix that landed in v3.8.2 doesn't appear to be in AS4.

I'm not sure if we intentionally didn't port this to AS4 (which was nearing its first alpha release on a different branch at the time) or if it was an oversight.

My memory (and understanding of my comment) is that the choice made by Apollo Server only actually caused a problem due to a bigger bug in graphql-js: graphql/graphql-js#3528

Basically, even though the "couldn't happen" error could in fact happen, the only case where it happens is a case where a thrown error should be swallowed and ignored due to the behavior of Promise.all anyway. However, the graphql-js bug meant that sometimes the error was not swallowed.

It looks like the graphql-js bug was fixed in v16.7.0 (https://github.com/graphql/graphql-js/releases/tag/v16.7.0) and AS4 requires graphql@16.6.0, which in fact is what you are using.

So you should be able to immediately fix your problem by upgrading graphql-js a tiny bit.

I'm not sure what the best approach should be for us here:

  • Just porting the v3 fix to v4
  • Bumping the graphql-js requirement in a new major version of AS (which I think we are going to do at some point anyway due to Node minimum version requirements)

@creativecreature
Copy link
Author

Hello,

Thank you for the prompt reply.

We integrated this plugin based on the guidance in the documentation. We followed the recommended implementation outlined on that page:

import { ApolloServer } from '@apollo/server';
import { ApolloServerPluginUsageReporting } from '@apollo/server/plugin/usageReporting';

const server = new ApolloServer({
  typeDefs,
  resolvers,
  plugins: [
    ApolloServerPluginUsageReporting({
      fieldLevelInstrumentation: 0.5,
    }),
  ],
});

If there's a potential for this plugin to throw exceptions that can bring down your server, it would be helpful for such information to be mentioned in the documentation. This would have allowed us to take precautions, such as wrapping the plugin with a try/catch block. Ideally, I think you should add an option that prevents the plugin from throwing exceptions, and I think you should set it to true by default. I can't speak for all of your customers, but we did not expect the usage reporting to be a potential single point of failure.

Thank you for considering our feedback

@glasser
Copy link
Member

glasser commented Oct 3, 2023

As I said, I do think the biggest bug here is the graphql-js bug that was fixed in v16.7. My belief is that upgrading your graphql-js to 16.7 will fix your problem immediately. It is true that AS4's dependencies allow you to use it with v16.6 (which is literally the only published version of graphql-js that is supported by AS4 but has this bug in it), and we should probably deal with this fact by reintroducing the AS3 patch — but if you want to fix your problem today, the real problem is that graphql-js's error handling is broken in v16.6.

We are throwing exceptions in a place that is supposed to be safe to throw exceptions, except that in precisely one supported version of graphql-js, it accidentally isn't.

The reason your server is crashing is that graphql-js is improperly hooking up its Promises, in such a way that Promises that should count as "handled" because they are passed to a Promise.all are not handled. This actually has nothing to do with our plugin at all: it can happen any time that an error is thrown from a "synchronous" resolver when there's also an asynchronous resolver for another field at the same level already running that throws an error. See the test from the PR that fixed the bug: if you run the query { x y } and y's resolver throws synchronously and x's resolver throws asynchronously afterwards (in their test there's no explicit throw but it's a non-null field returning null), you get the same unhandled promise.

The line in our plugin is merely something that makes it more likely that you'll hit this case, because it makes every resolver after a certain point throw. Since the bug is triggered by "at least two fields in parallel throw", it makes it more likely.

Again, we probably should apply the AS3 change to AS4. But even once we do that, you should still upgrade your graphql to at least v16.7, because the line in our plugin is not the only way to trigger this crashing bug!

As an analogy: graphql/graphql-js#3528 is a bomb sitting in the middle of your office. The throw in our plugin is a cute but annoying puppy running around and crashing into things at random. What you're seeing is the puppy unsurprisingly setting off the bomb. And yes, it probably would be a good idea to get rid of the annoying (but cute!) puppy no matter what, since this is a place of business, and it doesn't serve any super useful purpose to have a puppy there. But the most important thing is to get rid of the bomb, because anybody in your office could set it off, not just the high-energy puppy.

@glasser glasser changed the title willResolveField called after stopTiming AS4 contains code that makes triggering a crashing bug in graphql@16.6 (fixed in 16.7) more likely Oct 3, 2023
trevor-scheer added a commit that referenced this issue Oct 4, 2023
…ld` is called "late" (#7747)

Porting #6398 from AS3. This fix was unintentionally left out of AS4.
Fixes #7746

The minimum version of `graphql` officially supported by Apollo Server 4 as a peer dependency, v16.6.0, contains a [serious bug that can crash your Node server](graphql/graphql-js#3528). This bug is fixed in the immediate next version, `graphql@16.7.0`, and we **strongly encourage you to upgrade your installation of `graphql` to at least v16.7.0** to avoid this bug. (For backwards compatibility reasons, we cannot change Apollo Server 4's minimum peer dependency, but will change it when we release Apollo Server 5.)

Apollo Server 4 contained a particular line of code that makes triggering this crashing bug much more likely. This line was already removed in Apollo Server v3.8.2 (see #6398) but the fix was accidentally not included in Apollo Server 4.  We are now including this change in Apollo Server 4, which will **reduce** the likelihood of hitting this crashing bug for users of `graphql` v16.6.0.  That said, taking this `@apollo/server` upgrade **does not prevent** this bug from being triggered in other ways, and the real fix to this crashing bug is to upgrade `graphql`.
---------

Co-authored-by: David Glasser <glasser@apollographql.com>
@trevor-scheer
Copy link
Member

Fixed via #7747

@glasser
Copy link
Member

glasser commented Oct 4, 2023

(but just to be explicit (as you will see from the CHANGELOG in v4.9.4), the actual fix is still to upgrade graphql — upgrading Apollo Server will only reduce the likelihood of triggering this bug.)

@creativecreature
Copy link
Author

I appreciate the thorough explanation. As I mentioned earlier, I understand the underlying issue and how to fix it. I was merely offering my perspective in response to your, perhaps rhetorical, question: "I'm not sure what the best approach should be for us here".

While the occurrence of this "impossible" state might not be attributable to you, I think many of your customers, including us, would prefer the usage reporting to fail silently without bringing down the server.

I mean, there is a possibility that the graphql package could regress. They could, for instance, be working on a new major version in a feature branch and forget to cherry-pick this fix 😉

Thank you for considering our feedback

@glasser
Copy link
Member

glasser commented Oct 5, 2023

Appreciated — I'm glad we got this change in, and I'm also happy that we're now strongly recommending graphql v16.7 in our "migrating to AS4" guide and we'll make sure it's required by AS5. Thanks for bringing this up!

Copy link
Contributor

github-actions bot commented Nov 5, 2023

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
For general questions, we recommend using StackOverflow or our discord server.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants