Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Identity and the Worker Key #157

Open
escapewindow opened this issue Feb 5, 2020 · 10 comments
Open

Worker Identity and the Worker Key #157

escapewindow opened this issue Feb 5, 2020 · 10 comments

Comments

@escapewindow
Copy link
Contributor

(This is related to #156, but probably needs a few more questions answered.)

I can open an RFC once we have an initial consensus.


The goal is to provide an Artifact Integrity guarantee that a given artifact was generated by a worker under our control.

In this model, the worker manager will provide a key for each provisioned worker.

  1. Worker Manager provides a key to provisioned workers, using cloud provider instance identity
  2. Worker Manager allows for key generation for hardware workers, with documentation on how to protect this key
  3. Worker Manager provides an endpoint to query the public key for a given worker
  4. Worker Manager allows for keeping important worker history, until the artifacts uploaded by those workers expire
  5. Workers use this key to sign the sha256 of the artifact, and submit that signature along with the other artifact metadata.

Keypair

We've gone back and forth between PKI and no PKI. In the PKI model, we would have an intermediate cert on the Worker Manager, and sign the worker cert. We would trust the root cert and verify signatures through the web of trust. This brings up questions around key rotation and revocation that we should address if we go this route.

In the non-PKI model, we could generate a small unique keypair, possibly ed25519, per worker instance. As long as the public key is associated with the worker on the Worker Manager, we can verify its signatures. This means we'll need to keep the worker information in Worker Manager as long as we need to verify its artifacts. We also need to decide if we generate the keypair in the Worker Manager and send the private key to the worker, or if we generate the keypair on the worker and send the public key to the Worker Manager.

This is the Worker Key.

We're currently assuming we're going the non-PKI model.

Cloud provisioned workers

Aiui, cloud provisioned workers have an identity document from the cloud provider. Once the worker identity is verified, we can store the public key with the rest of the worker information. If the key generation happened on the Worker Manager, we can pass down the private key to the worker.

Hardware workers

The security here will be colo- and subnet-based security. We need some way to add a keypair to the hardware workers, and get the public key into Worker Manager.

Key rotation / reused workerIds

We can generate a new key for every cloud instance, especially if they're short-lived. If we reuse cloud workerIds we need to be able to either return a set of valid public keys, or perhaps add the datetime the artifact was created to the public key request. We may also want to be able to rotate keys on a hardware worker without changing its workerId.

Public Key query endpoint

For the non-PKI solution, the Worker Manager will keep track of each worker's public key(s), and either return the set of valid public keys for a given workerId, or the valid public key for a given datetime.

Preserve important worker history until artifact expiration

For the non-PKI solution, the Worker Manager will need to keep track of the important (read: level 3) workers until their artifacts expire. Likely we'll need to specify which worker pools are "important" in configs, and we'll need a join in postgres to find the latest expiring artifacts uploaded by this workerId.

Artifact content signature

The ContentSha256 of an artifact guarantees that the artifact has not been modified between artifact upload and artifact download. By signing this ContentSha256 with the Worker Key, we also show that the artifact was uploaded by a worker under our control.

@escapewindow
Copy link
Contributor Author

@taskcluster/services-reviewers let me know if you have questions or comments?

@djmitche djmitche assigned djmitche and unassigned djmitche Feb 18, 2020
@djmitche
Copy link
Contributor

I think the join-until-artifact-expiration could be better accomplished by just setting an "expiration" value per workerPool, and setting that to 1 year for level-3 workers (or whatever the maximum artifact lifetime we want is). That saves a join and simplifies the model a bit.

To allow revocation, we could store keys in a separate table and the static provider (for hardware) could allow revocation of keys during creation of new keys. We could allow revocation (but not regeneration) of keys for cloud providers, too. Then each key would have a time-span during which it is valid, and that could be compared to the timestamp of any artifacts it signed.

@escapewindow
Copy link
Contributor Author

I think the join-until-artifact-expiration could be better accomplished by just setting an "expiration" value per workerPool, and setting that to 1 year for level-3 workers (or whatever the maximum artifact lifetime we want is). That saves a join and simplifies the model a bit.

I think this works, as long as the 1 year expiry is counted after the most recent task run on that worker has completed... otherwise there will be some window where the artifact exists and the worker doesn't. Certainly if the worker lives less than 1 day, this may not be a big deal. If a worker lives for months (e.g. hardware), we may have issues unless we refresh the key or rotate workerIds regularly. We may still want to pad this: 1y + max_expected_worker_lifetime should cover it.

To allow revocation, we could store keys in a separate table and the static provider (for hardware) could allow revocation of keys during creation of new keys. We could allow revocation (but not regeneration) of keys for cloud providers, too. Then each key would have a time-span during which it is valid, and that could be compared to the timestamp of any artifacts it signed.

This made me realize that there's a period where the key is valid to sign new artifacts (the lifespan of the worker), and a period where the key is retrievable to verify signatures, but shouldn't be able to sign any new artifacts (the period between the worker going away, until the final artifact expires). I'm not sure how much we should address this: maybe a key-expires or worker-id-expires datestring, similar to taken-until/claim-expires ?

@jvehent
Copy link

jvehent commented Feb 19, 2020

Worker Manager provides a key to provisioned workers, using cloud provider instance identity

Could the workers generate the key and pass only the public key to the manager? That would prevent the manager from having access to sensitive key material.

Optional: could we leverage cloud features and use KMSs to hold those keys? That would remove the need to store & operate keys in the workers themselves, and would move the security control to the cloud provider instead. (Caveat: KMSs may not support signing operations).

@escapewindow
Copy link
Contributor Author

escapewindow commented Feb 19, 2020

Worker Manager provides a key to provisioned workers, using cloud provider instance identity

Could the workers generate the key and pass only the public key to the manager? That would prevent the manager from having access to sensitive key material.

This is possible, yes. The upside is the private key would never be transported over the wire or known by the manager, plus we don't run the risk of running low on entropy if we generate a large number of keypairs (not sure if this is as large a concern in newer crypto than, say, gpg). There is the potential for reusing keys or having some weaker algorithm on the workers, but we can address this with, say, worker runner, which can guarantee a specific version of the worker is installed. So yes, let's go with the generate-key-on-worker model.

Optional: could we leverage cloud features and use KMSs to hold those keys? That would remove the need to store & operate keys in the workers themselves, and would move the security control to the cloud provider instead. (Caveat: KMSs may not support signing operations).

Dustin pointed out that with the generate-key-on-worker model, this is an implementation detail. The cloud worker instance could potentially get the public key from the KMS, and submit that to the worker manager. We'd need to research the KMSs to a) make sure they support signing, and b) find out which signing algorithms they support, because that may influence our decision about what flavor of signing we use in general.

I'm under the impression that KMSs are only an option for cloud instances, and we'll still have to support key generation on the worker for hardware workers, so if we go this route, we'll need to support a hybrid approach.

@jvehent
Copy link

jvehent commented Feb 19, 2020

I'm under the impression that KMSs are only an option for cloud instances, and we'll still have to support key generation on the worker for hardware workers, so if we go this route, we'll need to support a hybrid approach.

Do we build artifacts on hardware workers? I genuinely don't know.

@escapewindow
Copy link
Contributor Author

Yes, we have PGO profiles we generate on hardware, which we download and use to build release builds. I suppose we could determine whether these are low-risk enough to not need worker keys.

@escapewindow
Copy link
Contributor Author

escapewindow commented Feb 22, 2020

More points from discussion Wednesday:

  • Worker Manager worker expiration
    • always have a max lifetime for cloud workers. expiration could be set to 1y + max_lifetime for those workers.
  • Worker Manager key management
    • for hardware workers, new worker manager api endpoints:
      • register new hardware worker
      • existing hardware worker: rotate key
      • existing hardware worker: expire/revoke/delete key (multiple endpoints?)
    • we could have a keys table, that maps many-to-one against the workers table.
      • valid_from, valid_to or not_before, not_after. The latter can be null if the key is currently active.
      • download tool could either get all keys for a given worker, or get a key for a given datestring
      • do we allow for multiple valid keys per worker, or do we expire the old valid key when we create a new valid key?

@escapewindow
Copy link
Contributor Author

Hm. This issue covers 1) taskcluster-provisioned cloud instances, and 2) hardware workers. We have a third type of worker we'll need to cover in the firefoxci cluster: scriptworkers.

The mac signers are hardware, so could follow the pattern for (2). All other scriptworkers are currently docker containers running in k8s. If we're able to handle that in the cloud-provisioned solution, great. Otherwise we may need to use the hardware solution for them, or think of a third way.

@djmitche
Copy link
Contributor

I suspect that the worker side of this functionality would be implemented in worker-runner, so it would "just work" for anything that uses the "static" provider. Depending on how dynamic that k8s deployment is, that might be easy or hard :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants