Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream tasks should be able to remove tasks #119

Open
bryce-turner opened this issue May 21, 2020 · 3 comments
Open

Jetstream tasks should be able to remove tasks #119

bryce-turner opened this issue May 21, 2020 · 3 comments

Comments

@bryce-turner
Copy link
Member

This is a relatively uncommon scenario during production work, but a nice feature would be to be able to remove a task from a rendered workflow.

For instance if a task is now removed or renamed in the new workflow the old tasks will still exist. In a similar area, it would likely be better if the mash function used intersection instead of union. Or label the old task as deprecated so we know how the previous data was generated, but it does not run by Jetstream.

@ryanrichholt
Copy link
Collaborator

I think we're at the point where we need to build some moderately complex test cases in order to understand the potential impacts of a changes like this. There is another idea floating around to allow the current backend setting to be used in the template rendering context. There are some potential downsides with that, and maybe they're acceptable, but I don't have a good way of illustrating them right now.

One of the problems that I see here is that a project is not tied to a single workflow. If I have multiple pipelines, I can run them all on a single project:

Pipeline 1 - Runs tasks A, B, and C
Pipeline 2 - Runs tasks D and E
Pipeline 3 - Runs tasks F and G

For some projects I might run Pipeline 1, then Pipeline 2. For others, I might run Pipeline 1, then Pipeline 3. And others I might run all three.

If the mash process removed tasks that were not in the new workflow, it would remove the records of the other pipeline tasks. It might seem like namespaces for the tasks would fix this, but there would still be problems. The workflow is intended to maintain an accurate record of the state of the project:

Pipeline 1 version 1 - Runs tasks A, B, and C

Pipeline 1 version 2 - Runs A, C, and D, but no longer includes task B for some reason.

If I run version 1 on my project, then run version 2. What is the state of the project? The outputs from task B will still be present. If task B modified the outputs of task A, those effects will still be present. I think it might still need to be accounted for in the project.

@PedalheadPHX
Copy link
Member

Does Jetstream support running multiple pipelines on the same project now? In the Phoenix workflow task configuration supports turning on and off specific tasks. I'm assuming in different pipelines they would never do the same tasks? But maybe this is not your intention.

In the example you provide I would think things are okay. The issue is when the new render does not include a task that previously existed and ends up being reset. In your examples assuming Pipeline 1,2,3 are completely independent and pipeline 2,3 does not depend on an output of pipeline 1 you would not encounter an issue.

I agree with need improved test cases with better documentation, the reset directives experience make that very clear.

@ryanrichholt
Copy link
Collaborator

Yes, it's always been designed to allow multiple pipelines run on a single project. They cannot run in parallel, because only one runner process can access a project at any time. But, you can use several pipelines together in a modular approach the way I described above.

Almost all of the complicated situations stem from a single feature - open access to the project files from any task.

Open access to the project files from any task is a feature that is extremely challenging to get right. This allows tasks to modify or even delete files that other tasks have created. It's useful to be able to clean up intermediate files as you go, especially when disk space is limited.

With a cloud-enabled backend, this would be nearly impossible to implement. You would need to somehow make the entire project folder to every worker node executing a task. This is a luxury of an HPC cluster with as shared file system, but a common setup on cloud platforms.

I hope we can preserve the open access idea, but get better about predicting which tasks need to be reset. I think there are a few ideas that will help:

  • Don't allow tasks to delete/modify any other task outputs. This would require a lot of changes, and it's not an attractive solution to me.

  • Reset directives: these allow you to link tasks together with "upstream dependencies". They're already in place with 1.6 and used in Phoenix, but sometimes they require mental acrobatics to understand.

  • Pipeline-task namespaces: I think there is a decent plan in place for how these would work. This only applies when running multiple pipelines on a single project, so not very relevant to Phoenix.

  • Improvements to the workflow mash function:

    • Update directives for all tasks, but only reset tasks if their identity has changed
    • Better handling of tasks that have been "removed" from new versions of a pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants