Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tasks to depend on artifacts #121

Open
petemoore opened this issue Jun 13, 2018 · 1 comment
Open

Allow tasks to depend on artifacts #121

petemoore opened this issue Jun 13, 2018 · 1 comment
Assignees

Comments

@petemoore
Copy link
Member

petemoore commented Jun 13, 2018

Tasks depending on artifacts

The predominant reason a task depends on another task is because it consumes one or more of its artifacts.

Currently the dependency relationship between tasks is defined at a task level; a task simply depends on other tasks. By augmenting the concept to include artifact dependency (i.e. a task can depend on specific artifacts of other tasks), we achieve some benefits:

  • The Queue won't schedule a task that depends on an artifact that didn't get produced. This saves resources, as sometimes tasks will depend on other tasks that don't produce the required artifact, since the task may declare directory artifacts rather than file artifacts so still resolve successfully, even though it didn't produce the artifact required.
  • The Queue could potentially provide some enforcement that if a task depends on an artifact of another task, and that other task definition doesn't declare that it produces that artifact, that the task that depends on it could be resolved as exception/malformed-payload. This isn't 100% water tight since some tasks enable taskcluster-proxy feature and publish artifacts explicitly at runtime, but we could consider introducing some delcarative requirements here if a downstream task relies on your artifacts. The advantage here is you get an upfront task exception when submitting to the queue, rather than waiting for your task to fail only at runtime after it has been scheduled.
  • The Queue could schedule a downstream task as soon as all the artifacts have been published that it depends on. If at a later time the source task resolves as exception/failure, the downstream task could be resolved as exception/<something> (resolution suggestions welcome!) This is much like how we used to have a buildbot optimisation that allowed some downstream tasks could run proactively, even if the parent job hadn't completed yet, and is typically referred to as speculative execution.
  • Relationships between tasks can be better understood by observers and tooling.

Currently a task dependency is defined as follows:

        "dependencies": {
            "title": "Task Dependencies",
            "description": "List of dependent tasks. These must either be _completed_ or _resolved_\nbefore this task is scheduled. See `requires` for semantics.\n",
            "type": "array",
            "items": {
                "title": "Task Dependency",
                "description": "The `taskId` of a task that must be resolved before this task is\nscheduled.\n",
                "type": "string",
                "pattern": "^[A-Za-z0-9_-]{8}[Q-T][A-Za-z0-9_-][CGKOSWaeimquy26-][A-Za-z0-9_-]{10}[AQgw]$"
            },
            "maxItems": 100,
            "uniqueItems": true
        },

I propose this should be changed to something like:

        "dependencies": {
            "title": "Task Dependencies",
            "description": "List of dependent tasks (and optionally artifacts of those tasks). See `requires` for semantics.\n",
            "type": "array",
            "items": {
                "oneOf": {
                    "title": "Task Dependency",
                    "description": "The `taskId` of a task that must be resolved before this task is\nscheduled.\n",
                    "type": "string",
                    "pattern": "^[A-Za-z0-9_-]{8}[Q-T][A-Za-z0-9_-][CGKOSWaeimquy26-][A-Za-z0-9_-]{10}[AQgw]$"
                },
                {
                    "title": "Artifacts Dependency",
                    "description": "The `taskId` of a task that this task depends on. When no artifacts are specified, the task must be resolved before this task is\nscheduled. When artifacts are specified, only those artifacts need to exist before this task is scheduled.",
                    "type": "object",
                    "properties": {
                        "taskId": {
                            "pattern": "^[A-Za-z0-9_-]{8}[Q-T][A-Za-z0-9_-][CGKOSWaeimquy26-][A-Za-z0-9_-]{10}[AQgw]$"
                        },
                        "artifacts": {
                            "type": "array",
                            "items": {
                                "type": "string"
                            }
                        }
                    },
                    "required": [
                        "taskId"
                    ],
                    "additionalProperties": false
                }
            },
            "maxItems": 100,
            "uniqueItems": true
        },

Example future use case (out of scope for this RFC) - tasks depending on "live" artifacts

Another thing we could do after implementing this, is allow tasks to depend on live artifacts (such as the livelog). This would warrant a separate RFC, but I'd like to demonstrate this idea here to show how introducing artifact dependency opens the doors to other platform features and optimisations. If we decide we would like live artifacts, we can create a separate RFC at the time.

In a shell it is very common to pipe the output of one process into the input of another. We have no way currently to chain tasks together like this so that they may run concurrently. However, if artifacts can depend on other artifacts, one task could depend on a "live" artifact of another, so that one task can stream data into the next, while they run concurrently. This also could be useful for isolating security contexts; one task could be e.g. a locked down script-worker worker, and another task could be a "run-what-you-like" docker-worker worker. Proxy components (like taskcluster-proxy) could then be simply other locked down workers, and people could create their own proxies. There are a myriad of pretty cool things you could do once you can chain concurrent tasks together.

In the case where the consuming task begins when the producing task has already been running a while, there should be no problem, since live artifacts (such as livelog) would be buffered while running and persisted on task resolution; so at no point would the consuming task "lose" streamed data - the full history is available until the artifact expires (which should be long after the deadline of the consuming task).

@petemoore petemoore self-assigned this Jun 13, 2018
@djmitche
Copy link
Contributor

This sounds related to #89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants