Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to cancel unresolved tasks in task group as soon as one fails #123

Open
petemoore opened this issue Jun 13, 2018 · 7 comments
Open
Assignees

Comments

@petemoore
Copy link
Member

We should have some way to specify that a given task should be cancelled/aborted if any of a given set of other tasks are resolved as failure/exception.

I haven't thought this totally through, but the use case is, you make a push for a CI job, and as soon as there is a failure, you want to abandon the task graph execution, because in any case you will be making a fix for the issue that was found, and will be pushing a new commit, which will trigger all tasks anyway.

This has the potential to reduce resource computation massively, maybe even by orders of magnitude.

What I'm not sure about is:

  1. how/where to specify this (since we don't submit a task group, we don't have task group settings, only task settings)
  2. what to do about intermittent tasks (maybe we need to exhaust retries before cancelling all tasks in task group)

But let's start discussing this!

@petemoore petemoore self-assigned this Jun 13, 2018
@djmitche
Copy link
Contributor

I think Travis-CI implements this with some kind of active process which reacts to specific events (e.g., a new push to a PR) and cancels other, related tasks.

Taskcluster-Github could potentially do the same sort of thing, perhaps based on settings in .taskcluster.yml in the master branch.

@petemoore
Copy link
Member Author

I like this too, but I think this could be a parallel strategy, as this would only take effect once the author had resolved the issue and pushed again. It wouldn't help so much in the case they make a push and then go to bed. I would like it to be possible for them to decide when they push (or even at some point afterwards) that unresolved tasks in the group can be cancelled if one of them fails, so this can happen in their absence.

@petemoore
Copy link
Member Author

Maybe this should be a standalone service listening to pulse, with an endpoint to register a task group id together with a cancellation policy. This way our core services are unaffected so we have a pretty nice separation of concerns. It also means the decision to apply a cancellation policy to a task group can be made after all tasks have been submitted, and guarded by scopes such that users other than the author can choose to apply a cancellation policy to a given task group (such as sheriffs).

The cancellation policy would define the criteria under which the task group's unresolved tasks get cancelled.

@owlishDeveloper
Copy link
Contributor

I think this is a very good idea.

As for taskcluster-github, I can implement a flag in .taskcluster.yml. When it's set, in the result status handlers I can get the list of all the tasks in the taskGroup and call queue.cancelTask for each of them, and then report failure in Statuses (in Checks, cancellation of the task will result in Neutral status, I believe, except for the one that failed).

@owlishDeveloper
Copy link
Contributor

For Checks, we can also implement a custom action button to cancel a task or the whole taskGroup (custom actions are defined on check-runs), so that it was possible to cancel the build manually if needed.

@djmitche
Copy link
Contributor

I think we'd want to support this for tasks in general -- not specifically in tc-github.

We have a "cancel" action defined in the action spec. We could easily also define a cancel-all action (and have done for Firefox).

I like the idea of an external service. It could listen to some specific route that is added only to tasks which should cancel others. Then there's no need to pre-register a taskgroup -- if a message arrives, the service does its thing. Perhaps a slightly more general approach where the service can run a named action when the task is resolved with a specific reason.

There are security concerns here -- you're allowing anyone who can create a task in a taskgroup (which requires only queue:scheduler-id:<schedulerId> matching the schedulerId of the taskGroup, and we only have a few schedulerIds in practice) to run potentially any action. You'd need some way to limit the scopes available for those actions, and the actions runnable on a taskGroup.

It would be interesting to see that prototyped outside of the Taskcluster services, to see if it's useful before incorporating it. For example, it's turned out that superseding isn't that useful, without ever building a superseding service for ourselves.

@owlishDeveloper
Copy link
Contributor

owlishDeveloper commented Feb 18, 2019

prototyped outside of the Taskcluster services

what would that look like?

I think we'd want to support this for tasks in general -- not specifically in tc-github.

I didn't mean that at all. It's just that in tc-gh that's a fairly simple thing to implement, it seems to me. So we could cut computation cost at least at the gh side fairly quickly. If later we come up with a service, and the way tc-gh works interferes with it - tc-gh can be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants