Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel resolver tasks if execution of an operation is terminated #217

Open
kkom opened this issue Mar 22, 2024 · 7 comments
Open

Cancel resolver tasks if execution of an operation is terminated #217

kkom opened this issue Mar 22, 2024 · 7 comments
Labels
investigate Needs investigaton or experimentation

Comments

@kkom
Copy link

kkom commented Mar 22, 2024

TL;DR

Would it be possible to cancel outstanding resolver tasks as soon as before execution of a GraphQL operation is terminated? Outstanding / unused tasks can happen as a result of an incoercible field resolver error.

These outstanding tasks can run well after lifecycle hooks responsible for cleaning up their context have completed. As a result, they often fail and cause a lot of noisy errors for us.

More details

Hello!

I'm coming from the strawberry-graphql project. We're running into a problem with abandoned resolver tasks not being cancelled, and thus failing in unpredictable ways.

Strawberry has an extensions framework which allows you to wrap the execution phase of your operation processing in hooks – running code before and after execution of resolvers. We use these hooks to manage (set up and tear down) some state needed by the resolvers.

However, when a resolver of a non-nullable field fails, it may lead to the results of some other resolvers being no longer necessary. It appears that graphql-core will then short-circuit returning a result, while the no-longer-necessary resolvers are still running. We see that they can continue running well after the HTTP response was returned.

The problem is that Strawberry's lifecycle hooks (and I also imagine - the webserver's) would have completed by that time. As a result, we do things like terminate the DB session – making it very likely that these outstanding tasks will fail with various esoteric errors. This causes a lot of noise in our observability suite – which we'd love to avoid.

See strawberry-graphql/strawberry#3414 for more details, and strawberry-graphql/strawberry#3414 (comment) for the comment which recommends I report the issue here.

@Cito
Copy link
Member

Cito commented Mar 22, 2024

Yes, that's something I also wanted to look into, but not sure when I will find the time.

Meanwhile, if anybody has suggestions or ideas how to tackle this, please comment.

@kkom
Copy link
Author

kkom commented Mar 23, 2024

Thanks so much for the quick response @Cito!

Are there any thoughts / pointers you can share yourself?

Don't want to make any promises, but I might be able to give fixing this a try.

@kkom
Copy link
Author

kkom commented Mar 23, 2024

Specifically I would appreciate it if you could point me to the parts of the code which do the collection of results and skipping of execution when an incoercible error happens -- maybe I'll have some ideas how to embed task cancellation in there.

@Cito
Copy link
Member

Cito commented Mar 23, 2024

Hi @kkom. One idea was to make use of task groups. I also wanted to explore the idea of using any.io/trio as suggested in #166.

The parts that you need to look into are mostly inside the execute.py module, particularly where handle_field_error is called.

A good strategy would be to first create the most minimalist example you can think of where this problem occurs, and some way to cleanly test for the desired behavior in a unit test.

Then you can make experiments, and check the code flow in a debugger.

Please note the constraints in this project: It is meant to be a more or less 1:1 port of GraphQL.js to Python. There should not be any groundbreaking changes where the code deviates from how it works and is structured in GraphQL.js. Keeping the code base very close to GraphQL.js is the only way to continually port the current development there back to GraphQL-core.

@kkom
Copy link
Author

kkom commented Mar 24, 2024

Thanks @Cito !

I have a few more questions:

  1. Task groups are a Python 3.11 feature, and Python 3.10 won't be deprecated until December 2026.

  2. Since I saw a mention of sorting the error list to make it deterministic – FYI, I did find some non-determinism in errors.

    • The speed with which coercible and incoercible errors occur affects the contents of errors.
    • Say you have two fields that fail – one nullable and one not.
    • If the nullable one fails first – you'll see two error entries.
    • But if the non-nullable one fails first – we won't include the error for the nullable one.
    • It may not be a huge deal – just sharing for your awareness.
  3. I assume that the short-circuiting behaviour of abandoning resolvers as soon as we know their results are not needed is intended and highly desired. Is this correct?

    • I can imagine a world where the server would simply wait for all resolvers to return their results. Increasing latency, but making things simpler (or maybe not ¯\_(ツ)_/¯...)
    • In the project where my company is using GraphQL we'd (probably) not mind the server working like this - if it meant not having to deal with the abandoned tasks problems.
    • Doing this would remove the non-determinism from point 2 too!

@kkom
Copy link
Author

kkom commented Mar 24, 2024

Ah, I think this thing you said answers 3:

Please note the constraints in this project: It is meant to be a more or less 1:1 port of GraphQL.js to Python. There should not be any groundbreaking changes where the code deviates from how it works and is structured in GraphQL.js. Keeping the code base very close to GraphQL.js is the only way to continually port the current development there back to GraphQL-core.

Still curious about your perspective on points 2 & 3, and if (to your knowledge) the same non-determinism in returned errors is present in GraphQL.js

@Cito
Copy link
Member

Cito commented Mar 25, 2024

I think the short-circuit behavior is the one wanted by GraphQL.js.

Yes, we should stay backward compatible, but I think it would be ok if certain features such as task cancellation work only in newer Python versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Needs investigaton or experimentation
Projects
None yet
Development

No branches or pull requests

2 participants