RFD 77 Hardware-backed per-zone crypto tokens

Table of Contents

1. Introduction
2. Definitions
3. Threat models and goals
4. Design
5. Relationship with TLS
6. Service startup step-by-step
- 6.1. CloudAPI
7. CN boot
- 7.1. EFI Secure Boot
8. Green-field deployment step-by-step
- 8.1. Installation
9. Brown-field deployment
10. Implementation and intermediate states
11. PostgreSQL and Moray
12. Zone soft token details
13. Hardware implementation
- 13.1. Operating system infrastructure
14. Triton infrastructure
15. Cryptographic algorithms
16. Recovery Scenarios

1. Introduction

While most existing access control for Triton and Manta focusses on authenticating users and authorizing their actions, there are quite a few contexts in which authenticating individual instances (zones) on the system is useful.

This particularly applies in the context of core service zones in Triton, where it would be desirable for each service to be able to authenticate itself to any other service, giving a path forward to eliminate the special "trusted" status of all hosts on the admin VLAN.

This kind of authentication is generally most easily done through cryptography, which is what this document will propose.

To store cryptographic key material securely with a machine, in a way that is highly resistant to compromise, the state of the art is to make use of a segregated hardware device (a "token"). The token stores the key material internally and will not reveal it to the host (and is generally a physically tamper-resistant device, making it extremely difficult to recover the key material from it without destroying it). The host machine may request the token to take particular actions with the key material, such as signing a given stream of data, encrypting a block of data, or computing a hash function.

This kind of secured credential storage would be useful not just for authentication, but also for the protection of data at rest. As a result, this proposal also includes provisions for supporting this use case (though implementing the on-disk data encryption is delegated to ZFS).

2. Definitions

CN: A "CN" here is any physical node in the Triton system.
Trusted CN: A "trusted" CN is one that is trusted to run components of the Triton system itself (e.g. VMAPI, NAPI, CloudAPI etc)
Self-booting CN: A "self-booting" CN is one that can boot entirely standalone, without any other machine in the datacenter or network access being available.
Headnode: A "headnode" is a term for a CN that is both trusted and self-booting. Any Triton datacenter needs to have at least one such node.
Core service: A "core service" is an internal Triton service which runs in a zone, such as VMAPI or CNAPI. These are often referred to as "headnode services" today, but we will avoid that term here to reduce confusion with the last definition.
ECDH: We will use the term "ECDH" to refer to elliptic-curve Diffie-Hellman key agreement protocol, as specified by NIST SP 800-56A (specifically the EC cofactor/CDH variant of the protocol). It is normally implied that the output of the protocol is processed by a cryptographic hash function before further use.
DH box: The term "Diffie-Hellman box" or "DH box" is used here to refer to an encrypted container for data, encrypted and authenticated using a symmetric key derived by performing ECDH. One of the key pairs involved in the ECDH protocol for a box is ephemeral, while the other is static. The private half of the ephemeral key pair is securely discarded after performing ECDH. Such a box can be constructed by anyone in possession of the recipient’s public key, but can only be decrypted by the owner of the private key.

3. Threat models and goals

3.1. Authentication

The core threat models (and containment goals) for the authentication scheme proposed here are as follows:

Threat 1: An intruder has escalated their network access into the admin VLAN, by compromise or poor configuration of network equipment or non-Triton resources.
Goal 1: The goal is to give this attacker no access to Triton resources. They may make a read-only request for a boot image for a new CN which will contain no special credentials, but no more. They may be able to carry out denial of service attacks on the admin VLAN, but these are out of scope for this design.
Threat 2: An intruder has escalated their network access into the admin VLAN, by compromise of an ordinary (not "trusted") compute node (privilege escalation and zone escape).
Goal 2: The goal is to give this attacker only the minimum access required for the normal operation of the CN. They will be able to control other zones on that CN, as well as the information reported about them back to the rest of Triton. They will under no circumstances be able to gain control of a trusted CN from this position. Their access to the system can be terminated by revoking the credentials of the CN, they cannot extract any long-lived key material, and cannot take any actions that would escalate or allow sideways movement into other CNs.
Threat 3: An intruder has taken control of a public-facing core service (e.g. CloudAPI), by making use of a vulnerability in that service.
Goal 3: The goal is to give this attacker only the minimum access required by the normal operation of that service. This means, for example, that CloudAPI would not be able to run arbitrary commands on CNs or directly interface with CN agents, or connect directly to the PostgreSQL database (since such access is not needed for its normal operation).

3.2. Encryption at rest

For the encryption of data at rest, the primary threat model is as follows:

Threat 1: An intruder gains physical possession of disks and/or hardware from a CN, either by post-disposal acquisition ("dumpster diving"), or outright physical theft.
Goal 1: The goal is to give the attacker no ability to read any customer data on the disks or (in the case of a disposed CN) any ability to use the credentials of the CN to gain access to Triton resources. If a stolen CN is powered up at the time of theft, it is possible that customer data can be read, but if powered down, no data access will be possible.

3.3. Customer-facing features

This design also seeks to provide 4 key customer-facing features:

Feature 1: The ability to use a provisioned instance/zone/VM in a customer account as an authentication principal to Triton (and other Triton-aware) services.
Goal 1: The credentials of this principal should not be able to be permanently compromised by an attacker who has full control of a customer zone (i.e. they must not be able to access key material).
Feature 2: The ability to have customer-provisioned instances authenticate to each other (both within a datacentre and between them) using credentials provided by Triton itself.
Goal 2: The credentials used for this authentication should not be able to be permanently compromised by an attacker who has full control of a customer zone.
Feature 3: The ability to implement a secure data store protected by hardware symmetric keys within a zone.
Goal 3: If an attacker compromises a customer zone storing N items of data protected by this mechanism, they should have no choice but to make N individual round trips through a (rate-limited) hardware module in order to decrypt them. If the attacker compromises an entire live Triton CN (including the contents of RAM) with M zones on it, they should have no choice but to make at least M round trips through a hardware module (or perform computation taking at least as long) in order to access customer data so protected.
NOTE: Goal 3 explicitly does not include absolute defense of this data against an attacker who has complete control of the OS kernel for an abitrarily long period. It does, however, set a minimum amount of time an attacker must be present with such control in order to break the security of protected storage on the machine: the attacker must spend at least as long there as it would take to make N trips through the hardware module.
Feature 4: The ability to provision instances onto encrypted datastores.
Goal 4: It is not required that every CN in a Triton install have encrypted local storage (though having all CNs use encrypted zpools is certainly an allowed configuration). If a customer decides that an instance will contain information that must be protected while at rest (i.e. encrypted), they should be able to guarantee that such instances are either provisioned on a CN with an encrypted zpool or the provision request fails if it is not possible to meet the encryption requirement — it is essential that an instance requesting encryption is never allowed to provision onto an unencrypted CN.

4. Design

The central component of the design is the credential storage device. Since many components of our threat model and goals are on a per-CN basis, we want a device that can be deployed with (or ideally, inside) every CN. This implies that:

The device must be inexpensive (at least, relative to expected cost of CN hardware);
The device must be capable of storing credentials both for at-rest encryption and for authentication; and
The device must not require invasive modification to current-generation x86 server hardware.

Most commonly, cryptographic token devices obey an API similar to PKCS#11, which is primarily focussed on public/private asymmetric cryptography. Devices that only implement asymmetric cryptography are suitable for storing authentication credentials, but do not always fit as well in a design that wants to store credentials for at-rest encryption. A notable exception is devices that support a key agreement scheme like Diffie-Hellman using their private key material, which can be used with an ephemeral keypair to form a Diffie-Hellman "box".

In hardware there are always difficult trade-offs between price, features, and performance. What is implicit in the above list of goals is that the cryptographic performance of the device is likely to be low (as it is both cheap and well-featured). As a result, the rate at which hardware operations need to take place must to be limited in the system design.

One device that is suited for these goals is the Yubikey (manufactured by Yubico). It implements a number of features aimed at the 2-factor Authentication market (based on hash chains and HMAC) which are also ideal for securely deriving encryption keys. Alongside these features, it features RSA and ECDSA asymmetric cryptography, both for signature operations and key agreement.

The Yubikey is relatively inexpensive (at $40 US it is a very small line item in the typical cost of a new CN), and since it uses the ubiquitious USB interface it can easily be added to existing server hardware (in fact, many servers include USB connectors that are located inside the server casing which are ideal locations for this use).

Alternatives to the Yubikey that are also well suited include a few models of USB JavaCard tokens, such as the Feitian eJava token (also sold as the PIVKey T800). These tokens can be written with appropriate JavaCard Applets to become a drop-in replacement for the Yubikey (exposing the same commands to the server).

The hardware details of these devices and the interfaces they expose is discussed further in the section Hardware implementation.

4.1. Overview: at-rest encryption

The concept for at-rest encryption is to use a randomly-generated key, and then to protect it cryptographically such that 3 pieces of information are needed to recover it:

A private key generated on the hardware token (which it will not reveal);
A randomly generated secret PIN stored on a trusted node service in the datacenter; and
The encrypted copy of some random data, stored as a ZFS pool property.

In this way, a node’s disks cannot be decrypted unless an attacker has all three of:

The disks belonging to the node;
The cryptographic token belonging to the node; and
Access to the PIN stored in the core service.

The primitive used to create these properties is the elliptic curve Diffie- Hellman key agreement protocol (ECDH). Setting up the pool proceeds as follows:

Generate a random byte string.
Create a DH box (see Definitions) that can only be decrypted using the hardware token’s private key. Place the random byte string in it.
Place the encrypted data from the DH box in a ZFS pool property along with the public key of the ephemeral keypair.
Use the byte strings as the ZFS encryption master key.

The private key in the hardware token is protected by a PIN — a 10 digit numeric code that must be provided to the token before any activity involving the key is permitted. After 5 failed attempts at the PIN (and an additional 3 attempts at a PUK), the hardware token erases its keys. This PIN code is stored in a Triton core service and is unique for each hardware token.

In order to re-derive the ZFS encryption master key for this node again on a subsequent boot, we will have to send this PIN back to the hardware token, perform ECDH with the hardware token’s private key, then decrypt the DH box to obtain the byte string from step 1 again to unlock the ZFS pool.

A single master key will be used for the whole pool, rather than a key per zone or per customer. The current ZFS encryption design does not allow clones of ZFS datasets to cross a key boundary, and since Triton relies heavily on zones being able to be clones of their image datasets, making a separate key spaces is impractical. Additionally, in the current Triton design, CNs are the source of truth about what zones run on them (and changing that here is out of scope), so there is little benefit in using a finer-grained scheme.

This approach has two major issues, however: firstly is the case of a headnode. A Triton headnode, as defined earlier, must be able to boot from its own media, without requiring the rest of the surrounding DC to be running (as it may be hosting the PXE DHCP server that allows other non-self-booting CNs to boot).

As a result, self-booting nodes will not use a remotely stored PIN. They will have the PIN code for their tokens either stored in USB flash media, or provided at every boot on the console (for environments where cold-theft security is more important than unattended reboot). This means that self-booting nodes do not meet the full goal discussed above — the theft of an entire working headnode will allow that headnode’s disks to be read.

This is a difficult compromise between fault tolerance, ability to boot the whole DC up after power loss, and security. It may be worthwhile to examine the possibility of special physical security measures to protect headnodes beyond those used for ordinary non-headnode CNs. As there is normally a small number of headnodes, this is at least more feasible than such protections for the entire server population.

The second major issue is durability, or ability to recover from the failure of a node’s hardware crypto token. Clearly it would be undesirable to create a single point of hardware failure that results in all data on the node being irretrievable. As a result, an additional step is added where as well as creating an encrypted DH box keyed to the hardware token for that CN, we create a second box keyed to a set of offline "recovery keys" for the datacenter, in a threshold scheme. The public half of the recovery keys are distributed to all CNs for this purpose, but the private halves are generated on hardware tokens elsewhere, held by specific trusted persons within the organisation, with a "2-person" or "N-person" rule applied. This is explored further in Provisioning and backups.

4.2. Overview: authentication

Authentication of a CN to a core service (e.g. to join the cluster, and then to report data about running zones etc) is done by signing existing protocol units (e.g. HTTP requests) using the asymmetric keys stored in the CN’s Yubikey. This is relatively straightforward.

Authentication of one core service zone to another is also done by signing existing protocol units using asymmetric keys. Existing protocols in use between core services are mostly variants of HTTP REST, and these will use the same HTTP signature method used by public Triton APIs. Non-HTTP core services will be expected to use TLS client certificates (the details of which will be explained shortly).

Unfortunately, hardware tokens are generally only capable of storing a small number of asymmetric keys, and the number of zones on a CN or headnode may be quite large by comparison. The performance limitations of hardware tokens (given the "inexpensive" price constraint we’ve already accepted) also mean that scaling their usage up with the number of customer zones on a machine is likely to be infeasible. So the keys used for zone-to-zone authentication cannot reside directly on the hardware tokens.

Instead, a "soft token" design will be used. A randomly generated symmetric key will be used to encrypt a keystore for that zone, and the key will be placed in a DH box openable by the hardware token’s private key. This keystore encryption is always used, so that the same code path is taken on machines with and without ZFS level storage encryption available.

The encrypted key store is managed by the global zone on behalf of the zones, and exposed to them via a socket that processes in the zone can connect to. The non-global zone cannot add or remove keys from the key store; it only holds a fixed set of keys that the global zone has pre-generated and assigned to it.

The socket is designed to make use of the OpenSSH agent protocol. This protocol is designed to be simple and straightforward to parse in a secure manner, and since the SSH agent is more or less a "soft token" itself, an almost perfect match for this use case.

The SSH agent also features support for SSH certificates, which can be used to attest about an identity associated with a given key. The CN’s global zone will generate one such certificate for each zone and sign it using the same key it uses for HTTP signature authentication. In this way, zones each have access to a signed statement from their host CN about their identity, which they can use as part of authentication.

A signed statement or certificate and a matching key is not enough on its own, however, to validate the identity of one zone to another arbitrary zone on the system — the other zone needs to also be able to validate the key of the host CN. To achieve this requires a chain of trust.

Agents running in the global zone of a CN are also expected to make use of a soft-token instance for their routine work of signing core service requests. The hardware tokens' workload will largely be limited to re-signing certificates for each soft token periodically, and deriving keys for encryption at rest.

4.3. Trusted CNs and chain of trust

As is typical with any chain of trust, we must begin with a set of keys known as "root keys", which are ultimately trusted. What we propose here is to use a single root key which is only ever stored offline, broken into pieces.

It is a key part of this design that the root key is not ever kept "on-line" in the datacenter. If trusted CNs were ever given access to a secret like a root key, and we ever needed to dispose of that trusted CN, we would be forced to change the root key — not just on that CN but on all CNs in the cluster. This creates severe administrative burden which we seek here to avoid: disposing of a trusted CN should not require revoking any credentials on other CNs.

This root key will sign an initial statement stating that certain nodes in the cluster are to be Trusted CNs, detailing their public keys, as well as a timestamp and serial number. It will then (barring exceptional circumstances) never be used again.

To this statement, the Trusted CNs of the datacenter may append additional statements, with certain restrictions:

Any appended statement must include a signature both over the new statement and all previous statements in the chain; and
The appended statement must be signed by the keys of all Trusted CNs in the datacenter at the time of appending, except one (N-1 out of N, unless there is only one Trusted CN at the time, in which case its signature is required ^[1]).

The statement may declare that a new node (with corresponding key etc) is now a Trusted CN, or it may declare that an existing Trusted CN is no longer such.

All CNs in the system (both regular and trusted) periodically gossip their current version of the Trusted CN chain out over the network, to a multicast address on the admin VLAN.

If a CN receives a new chain, it will accept it as the new canonical version of the chain if and only if:

All signatures on the chain validate, including validation of the N-1/N restriction; and
The chain is a strict extension of the current canonical chain known to the CN; OR
The chain is an unrelated brand new chain, with a higher serial number and newer timestamp on the very first statement.

In this way, in an emergency situation, the chain can be restarted by using the offine master key to sign a new statement about the Trusted CNs for the installation.

This design allows Trusted CNs to be added and removed from the installation at a later date without requiring that the root of the chain of trust be available in online storage for signing.

Once the gossip process has stabilized, all CNs in the system are aware of the identities and keys of nodes that are authorized to act as Trusted CNs (hosting core Triton services). This means that zone certificates presented by zones on these CNs can be validated, authenticating core services to each other.

It is important to note that changes to the set of Trusted CNs are expected to be infrequent, so it is not important to use a distributed system here that offers fast convergence. The simplicity of implementation of a gossip design is also an advantage.

4.4. TLS keys and Soft HSM

Aside from the main zone authentication key and its matching certificate, the soft token stores two more keys on behalf of the non-global zone: a TLS certificate signing key, and a symmetric key.

The TLS certificate signing key can only be used to sign X.509 certificates about keys generated locally within the zone. A Triton-specific extension to the SSH agent protocol allows for this, as well as the ability to request a certificate chain.

The certificate chain consists of a set of X.509 certificates describing, in order:

A trusted head node in the datacentre (self-signed)
The host CN of the zone (its hardware key, signed by the head node)
The soft-token TLS signing key for the zone (signed by the host CN)

These certificates (both the TLS signing key for the zone and the chain certificates, other than the head node) are limited to a very short window of validity (60 seconds). The intention is that this chain can be obtained and used only during an authentication process, and a fresh certificate obtained regularly to repeat the operation as neeeded. There is no need to check with a separate revocation list or manage one, as the short lifetime ensures that the key in question is vouched for by the system: all that clients are required to do is to keep their list of head node CA certificates up to date with the state of the gossip engine.

The symmetric key stored in the soft token is treated differently to other keys in token storage. It is not kept decrypted in memory in the soft token when not in use; instead, a round trip through the system’s hardware module must be made for every use of this key. This also implies that access to this key is rate-limited by the system to avoid users overburdening the hardware module.

Rather than encrypting material directly with this key, a data key scheme is used. This means that each "encrypt" or "decrypt" request made to use this key must be accompanied by an encrypted subkey. Inside the soft token, the subkey is decrypted using the master key, which is then used to encrypt or decrypt the actual data. This further limits the burden users may impose directly upon the system’s hardware module (by limiting the maximum amount of data that must be transferred through the token itself).

An encrypted subkey ready for use may be obtained using a third operation through the token interface. All 3 of these operations (encrypt, decrypt, and generate subkey) are Triton-specific extensions to the SSH agent protocol.

The intention of the symmetric key capability is to enable the implementation of systems that achieve the 3rd customer goal in Customer-facing features.

4.5. Binder and service registration

Having to make use of and validate full certificate chains for all traffic is somewhat difficult to work into some existing systems within Triton. A simpler proposition is to include only some form of key signature in these types of traffic (e.g. by embedding it a legacy username and password) rather than a full certificate.

To this end, binder (the Triton service discovery mechanism) will be altered, such that clients can establish a trusted relationship with binder, and binder can then take over the role of validating certificates on clients' behalf.

As the client half this relationship can be maintained from within a library such as cueball, this will ease integration for core services — they will merely need to use the cueball library to manage their connections and will then get identity validation on their outgoing connections "for free".

On the registration side of binder, registrants will be required to supply their SSH certificate and public key along with the information they supply to binder today (which will be signed with the key).

Binder will validate the signature and certificate provided, and then serve DNS records about the registrant. These records will include public key records containing the registered public key they supplied.

Traffic between binder and clients will be secured using the public-key modification of DNS Transaction Signatures (TSIG) known as SIG(0) (RFC2931), signed using the binder instance’s zone key. The client must validate the binder instance’s key against its certificate and the gossiped list of Trusted CNs, but thereafter it can trust signed responses from that binder about other services in lieu of performing full validation itself.

The SIG(0) mechanism provides authentication of data in the DNS packet using a cryptographic signature, but not confidentiality (the traffic is not encrypted). As binder is not serving information that needs to be kept secret, this is a suitable trade-off. It is transaction-oriented (signs the transactional message, not just the data inside), relatively simple, requires minimal modification of existing DNS software, is backwards-compatible and is also algorithm-agile (allowing us to change the precise algorithm in use over time). For these reasons, it is the proposed choice here over other alternative mechanisms like DNSCurve or full DNSSEC.

Binder will also have to transition away from using the raw ZooKeeper direct access for registration that it uses today, as the authentication schemes available there will not be sufficient to ensure separation of clients.

4.6. Provisioning and backups

When crypto tokens like the Yubikey are manufactured, they generally do not ship with credentials pre-loaded on them (Yubikeys do in fact ship with some basic credentials for the Yubico official 2FA, but this is not very useful for our usecase). They have to be commanded to generate or write credentials by an administrator who configures them before use.

Where possible, it is best for credentials to be generated on the token itself (so that they never leave it and thus cannot be directly compromised). Keys used for authentication or certificate signing can be replaced after a loss by creating trust for a new set of keys instead, so there is no real need to back them up.

Loss of at-rest encryption keys, on the other hand, leads to the loss of any data protected by them (meaning loss of customer data). To guard against this for the ZFS on-disk encryption keys, as explained earlier, we make use of a scheme similar to key escrow, where a second DH box is created that enables the retrieval of the ZFS encryption key using either the node’s own key, or an offline recovery key (or keys).

This recovery key, as well as the root key used to bootstrap the headnode chain of trust, must be stored offline in a way that is both very secure and very durable.

Keys may be split up into "pieces" for backup purposes, using secret-sharing arrangements like Shamir’s secret sharing. These enable schemes such as N out of M piece secret recovery (while revealing no information in the case of fewer pieces being held).

If the pieces are stored in separate geographic locations with separate access controls, this can enable a form of the "2-person rule" (or "N-person rule") to be enforced, where these valuable "master" keys can only be used with the co-operation of multiple trusted members of the organization.

While the "root" key can truly be treated as an offline master key that is only for serious (and rare) emergencies, hardware failures in a large datacentre are a regular, expected event. As a result, the recovery keys must receive different treatment for storage to enable efficient operation.

Our proposal is have hardware tokens assigned personally to trusted staff, have these tokens generate a public-private EC key pair, and write the set of N public keys for all of them into the Trusted CN chain as a separate kind of chain entry that has to be signed by all current Trusted CNs.

Then, on each CN we take the symmetric disk encryption key and split it into N Shamir pieces. Each of these pieces is then placed in an ECDH box targetting one of the public keys registered in the lastest backup instruction entry in the chain.

During recovery, we perform a challenge-response procedure (detailed later, designed to resist replay attacks and not reveal the key if exposed) using these ECDH boxes with the remote hardware tokens to reconstruct the original symmetric key from the decrypted pieces in memory.

As individuals come and go from this set, a new recovery key chain entry will be written and signed by the Trusted CNs. Then, all other CNs will regenerate their Shamir pieces and ECDH boxes from scratch with the new set of public keys.

The chain entry can also specify the number of the N pieces that will be required for recovery, so that it can be changed if the group shrinks or expands.

In summary:

Generation and preparation of the root key will take place in an environment away from the data center, and will be done in advance by administrators.
The root key will both be split into 3 pieces, in a Shamir arrangement requiring 2 pieces for recovery. Each of the pieces will be written to separate backup media.
The media may then be stored in a secure location (e.g. a safe).
The recovery keys will be generated on dedicated devices held by trusted individuals.
CNs will split their symmetric disk encryption keys into pieces and ECDH box them to each of the recovery public keys.
During recovery, a challenge-response procedure will be used to contact the trusted individuals and their hardware tokens and collect N/M responses to reconstruct the key.
The root public key and initial headnode trust chain (including the first recovery configuration entry) can be written to the boot USB flash media for the initial headnodes, and transported to the datacenter as part of the deployment process.
- As an alternative, the headnode setup process will accept the public key and trust chain root on the console.

This scheme will be implemented using a set of tools that can run on at least OSX, Linux or SmartOS, to correctly generate the root and recovery keys and back them up, and then also to perform restoration operations in an emergency. Backing up credentials as part of generating them will not be optional, and the tools will require backup media to be present to perform any operations, to prevent administrator error.

A recommended outline of the full deployment procedure is included in the sections Green-field deployment step-by-step and Brown-field deployment, which include examples for both a "small setup" deployment not using a pre-flight environment, and a larger deployment using one.

The following table highlights the recommended options for long-term key backup, as well as a recommended verification and refresh interval for each.

The verification interval indicates how often (at a minimum) an administrator should inspect and verify the data on the backup media to check its integrity. The refresh interval indicates a minimum interval at which administators should expect to have to copy the data to fresh media. Even if the current media passes inspection, it is recommended that media older than this still be replaced.

Table 1. Backup media recommendations

Media type	Verification interval	Refresh interval
Magnetic tape (LTO, DAT)	5 years	10 years
Printed archival paper	3 years	10 years
Optical (CD, DVD, BD)	1 year	5 years
Flash (SD, CF)	1 year	3 years

4.7. Use of zone keys and certificates by customers

Quite aside from the internal use of zone keys and certificates within Triton’s components, they are also expected to be used by customers.

In conjunction with the RBACv2 work (RFD 48), signing requests to Triton services (such as CloudAPI) using a zone authentication key will grant authentication as a "machine principal". This principal may be added to roles by a customer, in order to grant it authorization to manage resources under the account.

The keyId string used is expected to include the full UUID of the zone in question, and the UUID of the CN which hosts it. This mechanism will not require the use of the zone certificate.

Since the existing triton tools and libraries already support the use of the SSH agent for key storage, it is expected that they can be used with the zone soft token without significant modification (they may require some in order to generate the keyId correctly, but this is as yet unclear).

The existing support for account-key-signed certificates for Docker and CMON will be extended to support the use of those interfaces as a machine principal, as well. This mechanism is preferred for customer end-use here rather than the TLS certificate signing key, as it matches the interface already used elsewhere, reducing the amount of code needed to be specific to machine authentication.

Though it is somewhat out of scope here, it is expected that mechanisms for grouping machines as access control targets (e.g. RFD 48 style projects) may also be useful for grouping machines as principals. In this way it should be possible to grant some group of machines access to account resources and have this apply to newly provisioned members of that group automatically.

While zone SSH certificates and certificates signed by the TLS certificate signing key are not used for Triton authentication, endpoints on CloudAPI will be added to assist in the validation of zone certificates by customer code or services. These include fetching the current full set of headnode CA certificates for the X.509 chain. This should allow zone keys and certificates to be used for other purposes as well (such as bootstrapping a chain of trust for customer systems).

In particular, it is expected that full support for this mechanism will be developed to assist with the bringup of the Hashicorp Vault product. Vault should hopefully also be able to take advantage of the Soft HSM key system.

5. Relationship with TLS

To fully protect the Triton admin VLAN against IP and MAC spoofing attacks from rogue network hardware, it will be necessary to begin protecting all connections with TLS. Part of establishing a TLS connection is verifying the identity of at least one party to the connection, using X.509 certificates.

Note that while TLS server authentication is expected to always be in use, the providing and verifying of client certificates will be limited to those cases where HTTP signature authentication cannot be reasonably used.

The zone TLS certificate signing key is set aside for the purpose of producing TLS credentials. Core services will generate local keys (which may be rotated) for use by TLS servers, protected at rest by the Soft HSM key. A signed certificate and chain will be obtained through the soft token interface to allow these to be validated to others.

It is the responsibility of any Triton service to ensure that it obtains a new certificate chain for its TLS server endpoints before the expiry of a previous chain.

As these certificates have an enforced short lifetime of 60 seconds, no specific provision for certificate revocation is needed: only a requirement that the list of valid CA certificates be kept up to date by clients to match the output of the headnode gossip system.

6. Service startup step-by-step

6.1. CloudAPI

The Trusted CN hosting the CloudAPI instance boots up (see CN boot for more details)
1. It starts up the zone soft token manager daemon, which will LoFS mount sockets into all zones (see Zone soft token details). The daemon does not unlock the keystores at startup.
The CloudAPI zone begins to start up
1. Soft token socket is mounted into the zone.
SMF service cloudapi starts — it execs node
CloudAPI calls into the triton-registrar library to set up its service registration
1. Registrar opens the soft token socket and retrieves the public key and certificate signed by the GZ.
  1. Soft token manager daemon accepts the connection on the socket in the zone and forks off a dedicated privilege-separated child for this zone. The child then decrypts the keystore and loads it into memory.
2. Registrar connects to binder zones and begins registration by writing a signed statement about the CloudAPI zone’s IP address and keys, including the SSH certificate signed by its CN.
3. Binder receives and validates the registration
  1. First, binder retrieves the list of valid Trusted CNs from the gossip service on its host CN (via the soft token socket)
  2. Then, it compares the signature on the certificate given by the registrant to this list and finds it was signed by a valid Trusted CN
  3. The certificate presented includes metadata about the zone, including any values of sdc_role or manta_role tags. Binder validates that such values should be allowed to register under the given DNS name.
  4. After validating the signature on the statement from the registrant, binder begins serving DNS records about it.
CloudAPI opens its cueball pool to connect to VMAPI
1. Cueball is running in bootstrap mode, and first establishes a bootstrap resolver to connect to binder
  1. The bootstrap requests each binder’s certificate by looking up the binder service hostname with rrtype CERT (see RFC4398)
  2. The bootstrap resolver then retrieves the list of valid Trusted CNs from the gossip service on its host CN, and uses this list to validate the binder instances' certificates. It also checks that the sdc_role/manta_role value matches up.
  3. The TSIG information on the response is also validated.
  4. The bootstrap emits only the binders that pass validation (along with their keys) to be used as resolvers.
2. Cueball begins service resolution for VMAPI
  1. It uses the resolvers from the bootstrap stage to contact binder and request SRV records for VMAPI (and validates the response’s TSIG using the keys from the bootstrap).
  2. Validated records are emitted as backends
3. Cueball connects to VMAPI
  1. TLS is established, and the VMAPI’s certificate and chain is validated against the known CA certificates (obtained by querying the soft token).
Now CloudAPI is registered and connected to VMAPI. It repeats these steps (without bootstrap, since that’s already done) for other services.
When CloudAPI wants to make a request to VMAPI, it takes a pre-validated TLS connection from the pool and makes an HTTP request on it.
1. The outgoing HTTP request is signed with the zone key of CloudAPI, and includes CloudAPI’s registered binder hostname (the service name) as part of the keyId.
2. VMAPI requests the CERT records associated with the name connecting to it from binder and validates that a key there matches the one signing the incoming request.
3. Then, VMAPI validates the connecting service name against its own policy of which services are allowed to talk to it, and decides whether to accept or reject the request.

7. CN boot

Unlike headnodes, ordinary Triton CNs boot over the network. Today, this is designed to happen by launching the iPXE binary from flash media within each server. The iPXE binary then makes a DHCP request, and receives a response containing an HTTP URI from which to fetch the kernel and boot_archive.

iPXE supports HTTPS with certificate validation, and this will be used to secure the CN boot process. It is currently considered unreasonable to add a full software stack needed to produce signatures from the Yubikey’s asymmetric keys in iPXE, however, so it is proposed that anonymous access to the kernel image and boot_archive be maintained as it is today (i.e., the authentication at this stage will be one-way: the CN verifying the boot server’s identity, guarding against rogue DHCP and HTTP servers).

Since iPXE’s certificate validation mechanism is limited to a set of CA certificates, which have to reside on the same flash media as iPXE itself, we treat boot-up here slightly differently to regular service-to-service (or CN-to-service) authentication.

On the flash media with iPXE will be a set of self-signed X.509 certificates describing the keys of each of the headnodes in the datacenter at the time when the flash media is prepared.

The booter zones in the installation will generate a local TLS private key each, and have it cross-signed by the signing keys of all the headnodes in the data center. They will serve the full set of cross-signed certs in their TLS handshake, as alternative chains ^[2], so that the flash media need only contain one headnode in common with the real current set for the boot to be successful.

Once a CN has been set up and is operating normally, it will periodically mount its boot flash media and update the set of headnode CA certificates stored there.

Some Triton installations do not boot iPXE from flash media, and instead use the built-in PXE ROM in their system. Unfortunately, the only known way to build an authenticated system around the firmware PXE is to leverage the EFI Secure Boot and TPM features of a modern system, and support for using these with PXE is difficult (due to lack of general EFI support) and somewhat inconsistent between server vendors. It would also require the ability to modify at runtime the certificates stored in firmware for boot signing, which currently is not a well-supported procedure, regularly subject to vendor firmware bugs and exclusion.

For this reason, installations which depend on system PXE firmware will not have a fully secured boot procedure, and will not meet all of the stated goals of the system. This may be revisited at a later date.

7.1. EFI Secure Boot

No provision is made in this document for the implementation or management of EFI Secure Boot in Triton. Several unresolved problems remain before a design can be proposed here.

This will likely be the subject of a future RFD.

8. Green-field deployment step-by-step

8.1. Installation

This setup process will need to provision a KBMAPI instance and setup the head node token (probably more as well). If the head node is to be encrypted, then it must be setup (token setup, encryption enabled) at the time of zpool creation.

Note	This section needs updating after the change to personal recovery keys and updates on the trust chain.

This section will run through the full set of steps needed to deploy Triton with full RFD 77 security enabled.

We begin the process by setting up the root key on an administrator workstation. On this workstation, we will begin by burning 3 DVD-Rs on which to store key backups.

After inserting the first blank DVD-R:

alex@mbp:~$ triton-keymaster init-media dvd (1)
Found blank DVD media in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0) (2)
Initialize? [Y/n]
Generating media key... done
Writing session... 10% 25% 50% 75% 100% done
Short name to refer to this media? [214cc7d2] sfo-001 (3)

We want to initialize a new DVD type backup media. The name we give here refers to the storage plugin to be used.
The plugin detects that we have a blank unused DVD-R in one of our drives.
This name will be used with later triton-keymaster commands. If we want to use this same media from a different machine, we can copy the file ~/.triton/keymaster.json or use triton-keymaster add-media and the full media identity string.

We perform these same steps for the subsequent 2 DVD-Rs, naming them ord-001 and nyc-001.

alex@mbp:~$ triton-keymaster init-media dvd -y -n ord-001 (1)
Found blank DVD media in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Generating media key... done
Writing session... 10% 25% 50% 75% 100% done
alex@mbp:~$ triton-keymaster init-media dvd -y -n nyc-001
Found blank DVD media in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Generating media key... done
Writing session... 10% 25% 50% 75% 100% done

-y means "don’t prompt me for confirmation", and -n is used to give the media short name.

Now we generate the root keys for the datacenter:

alex@mbp:~$ triton-keymaster init-dc us-west-1 -m sfo-001,ord-001,nyc-001 (1)
Number of backup media required to recover root key? [2] (2)
Generating root key... done
Generating ZFS recovery keys... done
Ready to write piece for backup media sfo-001.
Attach where? [LOCAL/remote/file] (3)
Found sfo-001 in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Writing session... 10% 25% 50% 75% 100% done
Ready to write piece for backup media ord-001.
Attach where? [LOCAL/remote/file]
Found ord-001 in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Writing session... 10% 25% 50% 75% 100% done
Ready to write piece for backup media nyc-001.
Attach where? [LOCAL/remote/file]
Found nyc-001 in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Writing session... 10% 25% 50% 75% 100% done

The -m option allows you to supply the names of the backup media keys to use for this datacenter. If not supplied, you will be prompted.
These answers can also be supplied as commandline arguments.
After the initial media setup, backup media can be accessed in multiple different ways by the keymaster tool. They can be attached locally to the machine it is being run on (as shown here), or attached to a remote machine (with keymaster also installed), or written to a file to be transferred later. The key backups are encrypted in transit and cannot be read without the backup media itself.

At this point, we can also write the recovery keys to some hardware tokens to place in storage with the backup media. This is optional, but recommended for production deployments: if an administrator has to step in to recover a CN from a broken hardware token late at night (with possibly impaired judgement), it is better to handle the keys on a secured device like a USB token where it is harder to make mistakes that may compromise the key itself.

alex@mbp:~$ triton-keymaster write-token us-west-1 (1)
Which ZFS recovery key to write? [A/b/c] a (2)
Need to read key pieces from 2 more backup media.
Attach where? [LOCAL/remote/file]
Found sfo-001 in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Reading data... done
Need to read key pieces from 2 more backup media.
Attach where? [LOCAL/remote/file] remote (3)
Generating ephemeral key for remote challenge-response... done
Challenge: AavNCXVzLXdlc3QtMRAHb3JkLTAwMQdueWMtMDAxBWVjZHNhQQRKMlDjH/3I/x5JZzh3RqtoendWyr9Aj2hz4vV9lETQWdrxkmnbDeoMjRi9ll3mDALaP5tmkh4QIClvjjIJv0pOcS6Agg==
Enter this challenge at the prompt presented by `triton-keymaster respond' on the remote machine.
Then enter the response from the remote machine here.
Response: gavNBWVjZHNhEWNoYWNoYTIwLXBvbHkxMzA1DOsc+I31pxTqOL75flqSq5Cuz9hqfvKaRZHe8aEYkaMUBQZLbKyqunZRqiSHWsA0Dxo1HsVfBbIetNOqP2e5+JUnk9wS72B4sWmaojxC2nTUm6BiC+zAzW9px6uzwow5Y5KUFsYUHlSLB+mB
Found response from backup media ord-001.
All key pieces found.
Ready for Yubikey or Token for writing recovery key... ok
Found Yubikey (Yubikey 4 OTP+CCID), serial 4a701a, v4.3.1
Writing keys to Yubikey... done

We have to specify the datacenter in order to fetch the backup media and key configuration.
We can choose which of the 3 recovery keys to write out, so that we still enforce the same 2/3 rule for access.
Here we choose to get a piece of the key from a remote system. This prints out a base64-encoded "challenge" value, which an administrator at the remote site can copy-paste into their "triton-keymaster" tool to generate a response.

The challenge-response cycle here is secure (encrypted) and unreplayable. The use of the respond command on the remote administrator’s machine looks like this:

john@mbp2:~$ triton-keymaster respond
Enter challenge: AavNCXVzLXdlc3QtMRAHb3JkLTAwMQdueWMtMDAxBWVjZHNhQQRKMlDjH/3I/x5JZzh3RqtoendWyr9Aj2hz4vV9lETQWdrxkmnbDeoMjRi9ll3mDALaP5tmkh4QIClvjjIJv0pOcS6Agg==

Challenge purpose: for master key recovery from backup media.

This is NOT a challenge used to recover a compute node with a broken Yubikey.

Datacenter: us-west-1
Key being recovered: ZFS recovery key A
Backup media they have: sfo-001
Backup media they want from you: ord-001, nyc-001

Challenge was generated 3 minutes ago by user "alex" on host "mbp"

WARNING: Responding to this challenge will give the remote party an entire ZFS
         recovery key. If they possess 2 of the set of 3, they will have enough
         information to decrypt the disk of ANY node in datacenter "us-west-1".
Respond to challenge? [y/N] y

Need to read key pieces from backup media: ord-001, nyc-001.
Attach where? [LOCAL/remote/file]
Found ord-001 in HL-DT-ST DVDRW GX30N RP09 (scsi 1,0,0)
Reading data... done
Response: gavNBWVjZHNhEWNoYWNoYTIwLXBvbHkxMzA1DOsc+I31pxTqOL75flqSq5Cuz9hqfvKaRZHe8aEYkaMUBQZLbKyqunZRqiSHWsA0Dxo1HsVfBbIetNOqP2e5+JUnk9wS72B4sWmaojxC2nTUm6BiC+zAzW9px6uzwow5Y5KUFsYUHlSLB+mB

9. Brown-field deployment

Deployment to an existing DC will require at least one empty compute node. The operator starts by installing kbmapi on the head node by running sdcadm post-setup kbmapi. This will install and start the KBMAPI service on the head node. Once complete, the operator proceeds to run the setup process on the empty nodes. The empty nodes must have a PIV token inserted at the time of setup. The setup program will provision the PIV token (using piv-tool), create the encrypted zpool, and store the pin information in KBMAPI.

Once setup, instances can be provisioned as well as migrated to the encrypted CN.

Conversion of the head node to an encrypted zpool will be covered by a separate RFD (TBD).

10. Implementation and intermediate states

So far, we have described the eventual state of affairs that Triton will be in after a full implementation of this document. However, the process of implementation will necessarily involve some intermediate states of development, which will likely also be deployed to some installations along the way.

Additionally, not all administrators of Triton installations will see fit to deploy with hardware tokens — and it may be prohibitively difficult to do so in some cases — e.g. deployments within virtual machines for development.

Do the USB key and token support stuff first
Then soft-token (well, at the same time really)
The road to validating everything in the admin vlan, what intermediate states will look like while upgrading.
What things will look like if you never add any Yubikeys (TLS with just self-signed certs, open trust).

11. PostgreSQL and Moray

Auth and TLS. Using LDAP to validate signatures as passwords?
In current version of PostgreSQL, the main limitation for using mTLS for AuthN/AuthZ is that PG has not supported reloading of certificates without a server restart. PostgreSQL now has certificate reloading on master, not yet in PG9.6. Reload is triggered by SIGHUP and/or "pg_ctl reload." Backporting a patch to PG9.2 would not be difficult (postgres change on master.)

12. Zone soft token details

The soft token consists of a number of key components:

The ECDH private key, stored in the CN’s hardware token
The soft token key data files, stored encrypted on ZFS within the zone’s dataset
The SSH agent protocol socket, placed as a UNIX socket within the zone’s filesystem
The soft token daemon itself, running within the global zone, and listening on the UNIX socket

12.1. Soft token key data

Soft token key data will be stored in the /zones/$uuid/softhsm directory. Each key stored on behalf of the zone will be stored in a separate file, encrypted (and authenticated) using ChaCha20-Poly1305.

The file format will consist of an nvlist with the public key of the hardware token, a DH box containing the symmetric key to decrypt the rest of the data, as well as the MAC and details of the algorithms in use. The MAC will be constructed to cover the algorithm metadata fields.

12.2. SSH agent socket

The SSH agent socket for communicating with the soft token will be placed in the /.zonecontrol directory.

The existing metadata.sock inside the zonecontrol directory currently relies on the permissions of the enclosing directory to manage access to the metadata socket. These permissions will be moved to the socket itself, and the /.zonecontrol directory will be world-readable and world-traversable. The agent socket will use privileges, not filesystem permissions, to manage access.

The socket file itself within /.zonecontrol will be named token.sock (i.e. its full path will be /.zonecontrol/token.sock). The socket file will be world-writable and world-readable.

Upon a connection being made by a client process, the soft token daemon will examine the cred_t of the connecting process. Either a new system-wide privilege bit, PRIV_ZONE_TOKEN will be added, or a parametrized privilege will be implemented, and any connecting process in possession of this privilege will be allowed to use the soft token.

This privilege will be part of the default zone-wide limit set, but not part of basic or the ordinary user privilege sets. This means that by default, only root will be able to use the soft token, but end-users can configure their zones to give this privilege to ordinary users or single processes, and processes can give up the ability to use the soft-token if they no longer require it (enabling privilege separation models to be used).

12.3. Soft token daemon

The soft token daemon is started in the global zone as a child of the soft token manager process. The manager itself is started by SMF.

The top-level manager process' role is to manage the lifecycle of socket files and lofs-mounting them into zones. Each time it creates a new socket for a given zone, it forks into a child which handles that zone.

The zone child of the manager is a privileged process whose role centers around management of key material. It maps dedicated areas of memory (with MAP_SHARED supplied to mmap()) for the placement of keys, fills them with the encrypted key data, and then forks.

This final child is the process which is responsible for speaking the SSH agent protocol and performing cryptographic operations. It drops all privileges (including those in the basic set) before accepting any connections. To unlock keys, it sends a fixed-size request on a pipe back to the key manager process, which decrypts the keys in-place in the shared memory segment.

12.3.1. Performance and accounting

Unlike a regular SSH agent, the soft token daemon final process (serving the real workload of the zone) will be multi-threaded. Operations will be carried out by worker threads in a thread pool of limited size. This enables both pipelining of operations within a single agent connection, and also concurrency across multiple connections.

Eventually, a mechanism will be used to place the final child process into the non-global zone for CPU accounting purposes, without making it able to be traced or debugged by the zone (this will be analogous to a system process in the global zone).

12.3.2. Hardware memory protection

Pending hardware and operating system support, the soft token will support the use of Intel SGX enclaves (and the analogous features on AMD platforms) to protect the key data and operating state of the soft token in memory.

This will defend against a variety of attacks on the soft token from other parts of the system, as well as cold-boot attacks on system memory. Noting that, as the soft token is a signing oracle in regular operation anyway, the goal here is to prevent bulk fast access by an attacker to all the keys on a machine (a kind of "class break"), not absolute inviolability.

SGX has been the subject of much industry discussion in recent months, and the results achieved by others with it have been mixed. However, as our goal here is not to achieve an impregnable enclave within a totally untrusted operating system, but instead to simply make sure that there is no method of obtaining keys faster than to ask the hardware to decrypt all the key files on disk, we should be well-placed to make use of it.

12.3.3. Cache side-channel mitigation

On modern Intel CPUs, the soft token will (pending OS support) make use of the Intel CAT feature to mitigate CPU cache timing side-channel attacks. This will be done along the lines of the "CATalyst" paper where a special subset of the L3 cache capacity on the system is set aside for transient use in cryptography, and dedicated pages for this purpose pinned into cache so they cannot be flushed out (containing both the code and data used in the sensitive operation).

At the same time, we plan to make use of improvements in hyperthread scheduling to avoid sharing any L1 cache between soft-tokens or between soft-tokens and customer workloads.

Soft-token processes also will not share any memory pages (including code pages) with each other or any other part of the system — this is aided by an operating system facility to mark binaries and shared objects are "unshared" so that they are always duplicated into each process that maps them. KSM (kernel same-page merging) and other similar mechanisms are not (and will not be) supported by illumos.

As well as this direct mitigation, the algorithms chosen (see the Cryptographic algorithms section) for soft-token usage are chosen with side-channel leak prevention in mind.

The chosen algorithms combined with these mitigation techniques should prevent most known mechanisms of memory timing side-channel leakage from the cryptographic algorithms run in the soft-token, including Flush+Reload and other related attacks.

13. Hardware implementation

Both the Yubikey and JavaCard USB tokens present a common interface — the USB CCID (Chip Card Interface Device) device class. As this (unlike the HID interfaces on Yubikeys and other devices) is an open interface, with readily available specifications, this is the interface that is used for the purposes of this design.

The CCID interface was originally intended for communication between hosts and smartcards that speak the ISO 7816-4 protocol stack. Even though the USB devices discussed here are not a smartcard in a card reader, they present themselves to the host as if they were one. This means that the ISO 7816-4 protocol must be used to communicate with them, just as for a real smartcard.

While the ISO 7816 family of specifications specifies the commands and protocol used for this communication, as well as some aspects of the data model on compliant cards, it does not fully specify the structure and organisation of key material storage.

As a result, additional specifications have arisen to describe the "directory structure" and missing details of data model for particular applications using cryptographic smartcards. One of the most commonly known and implemented of these is the NIST Personal Identity Verification (PIV) standard. This standard is implemented by both Yubikeys and other JavaCard token manufacturers.

As a result, for asymmetric crypto operations, the interface that the RFD77 implementation uses is PIV over ISO 7816-4 over CCID over USB. We also use this interface for performing ECDH to derive disk and soft-token storage keys.

PIV specifies a fixed number of key "slots" on the token, and rules about whether PIN or biometric authentication, or a secure channel is required for each. As we are not identifying human cardholders or using a non-contact interface like NFC, we will mostly avoid using these features, with the exception of the PIN which we will use to require the network connection to a core service for a compute node to boot.

Yubico have implemented a number of extensions to the PIV specification which include support for importing a key generated off-card, setting management keys, changing PIN usage policies and performing attestation. We will not have a hard dependency on these extensions in the implementation of this RFD, but we may implement optional support for using them.

13.1. Operating system infrastructure

Most other open-source operating systems (e.g. GNU/Linux distributions) use a userland-only suite of software for interacting with CCID smartcards. These are usually backed by libusb or similar (the leading example of such a suite would probably be OpenSC and pcsclite).

Proprietary operating systems such as Microsoft Windows and the Apple Mac OS have instead opted to implement fairly deeply integrated smartcard suites in the operating system base, in order to fully support integration with other operating system features (e.g. using smartcards seamlessly for user login, or Windows domain machine authentication etc).

For SmartOS, we propose to implement a hybrid approach similar to the Apple Mac OS. There will be a deeply integrated operating system component for card identification and operational use, but card administration and deployment operations will be handled by software running entirely in userland.

This will allow us to integrate deeply with operating system features such as the fine-grained privilege model and RBAC, as well as zones. We will provide a public interface specific to SmartOS (working title libchipcard), as well as implementations of the PCSC API (compatible with pcsclite and Mac OS) and a subset of PKCS#11.

Components built as part of this design (e.g. the soft token, and key provider for ZFS) are expected to exclusively use the libchipcard interface, with the exception of the deployment and administration tools, which will be largely based on the PCSC interface (which will also make them largely cross-platform).

The OS infrastructure to be built out here, including the libchipcard interface, will be the subject of a forthcoming RFD specific to their implementation.

13.1.1. In depth: first-time setup

When an operator sets up a compute node using sdc-server setup, the UR instance running on the compute node in question runs the joysetup.sh script which (among other things) passes a JSON description of the pool layout to create on the compute node (raidz2, spares, etc) to the mkzpool command (which resides on the PI) which then runs zpool create with the necessary command line parameters based on the given JSON topology.

The mkzpool command will be extended to accept an -e flag to create an encrypted zpool. When present, instead of executing zpool create, instead mkzpool will execute kbmadm create-zpool with the same parameters as zpool create.

kbmd will then search for a pivtoken that has not been setup. If no such token is present, the command will fail. If a pivtoken is found, kbmd will perform the following steps:

Initialize the token. The token will generate a random GUID.
Create the 9a, 9c, 9d, and 9e keys on the token, as well as the attestation certificates (if supported by the token).
Set a random 6-digit pin for the token.
Generate a random encryption key for the pool
Create a DH box and store the key in the box.
Save the pivtoken information (including the pin) with KBMAPI and obtain the recovery key.
Create a recovery box using the current set of recovery keys reported by the gossip service and the recovery key obtained from KBMAPI.
Create an ebox with the encryption key and recovery keys.
Return the additional -O arguments for the zpool create command as well as the pool encryption key to the invoking kbmadm process.

The kbmadm command takes the returned arguments, combined with the zpool create arguments passed to it to create the actual zpool create command that is executed. The command will look similar to:

zpool create \
    -O encryption=on \
    -O keyformat=raw \
    -O keylocation=prompt \
    -O rfd77:config=<base64 encoded ebox> \
    zones <arguments to kbmadm create-zpool>

The kbmadm command will write the pool encryption key to the stdin of the execed zpool create command.

13.1.2. In depth: events

The kbmd door also provides a mechanism for clients to receive events. The principal kind of event that clients are interested in is when the "primary token" of the system is about to change and has changed. All persistent users of kbmd are expected to handle this event.

13.1.3. In depth: managed boxes

A client of the kbmd door may send a command to create a managed box (like the soft token key boxes). The command takes a "path pattern", which looks like: "/zones/abcd123/keys/auth.%s"

If the current primary token has GUID 995E171383029CDA0D9CDBDBAD580813, the client must have already created "/zones/abcd123/keys/auth.995E171383029CDA0D9CDBDBAD580813" as a PIV-box format file, a single box keyed to that current primary token.

kbmd will open that box and set up entries in the backup registry before returning from the door call.

Thereafter, the application may not delete or modify the "/zones/abcd123/keys/auth." files, but it may open them for reading in order to retrieve the data held within. It must be subscribed to notifications about a change in primary token so that it always opens the correct auth. file for the GUID of the current primary token at the moment when it opens it (and if it talks to piv-agent and discovers the key it needs to open a box is missing because of a primary token change, it should wait for the change notification and try again).

If the system goes through recovery and has a new primary token, kbmd will create a new primary token box file with the new GUID based on the backup registry and the application will find it by asking for the new GUID.

13.1.4. Interaction: piv-agent

The piv-agent is the intermediary that most other processes on the system will go through to make use of the Yubikey. piv-agent will connect to the kbmd door at startup and ask for the primary token and PIN, as well as setting up a subscription for primary token change events.

If it receives a notification about the primary token changing, it changes its own configuration to use that new token (and new PIN) for all subsequent requests that it handles.

13.1.5. Interaction: soft token

Like the piv-agent, the soft token daemon is interested in the primary token changing — both for signing certificates and for making use of managed boxes.

13.1.6. Interaction: ZFS

After setup has completed kbmd will manage the rfd77:config ZFS property on the pool itself directly. This will be read during boot and written during recovery or rekeying. kbmd itself will also make the libzfs call to provide key material to the pool during the unlock process (since it needs to do this and mount at least some of the pool’s filesystems before it can check on the state of any managed boxes on disk).

13.1.7. Interaction: Triton

kbmd will have a dependency on an SMF service that brings up the "admin" network early in boot ONLY when we booted with a networking.json (the SMF dep is still there in non-Triton but the service is a no-op). This is in order to enable it to retrieve the PIN for the primary token from CNAPI after authenticating with its 9E key. This is currently tracked as OS-7183.

We might also want to perform some sanity checks during boot to ensure the token attached to the booting CN is also assigned correctly in KBMAPI.

Additionally, to support provisioning an instance onto an encrypted CN, sysinfo will updated to report if the zpool is encrypted. This is currently tracked as OS-7633. In short, a new property Zpool Encrypted will be added to the sysinfo JSON that is produced, indicating the encryption status of the main zpool.

Given that the requirement for encryption for a VM is a constraint for DAPI, we should store that as an affinity rule which will be used at creation time to pick the right CN and, since we’ve added the requirement to persist the affinity rules into VMs as part of RFD 34 (See TRITON-779, already implemented), there is no need to add further modifications and, for example, there’s no need to alter VMAPI’s schema for Virtual Machines.

The required changes would be:

Decide which is the affinity rule for a Virtual Machine which requires encryption, for example, server==encrypted (which will also allow the explicit opposite rule server!=encrypted).
Any machine created/migrated through CloudAPI can use such rule as we do today for other affinity rules. This needs to be clearly documented, given the public nature of the API.
Update DAPI to know about this rule when picking machines CNs, either for provision or migration of machines.

This assumes that DAPI is able to use the Zpool Encrypted property from sysinfo, as described in OS-7633.

13.1.8. Interaction: ccid driver

kbmd will utilize a new OS driver (ccid) to communicate with the PIV tokens. This will be via the apis provided by the libchipcard library detailed in Operating system infrastructure.

14. Triton infrastructure

14.1. KBMAPI

14.2. Chassis Swaps

It is expected that once a token has been inserted into a server, it will remain present in the server until such time that the server is removed from service (due to failure or lifetime concernts) or until the token itself has failed. If the token fails, a recovery process must be initiated as the key material contained in the token is now gone. However, if a server is replaced while the storage is retained (i.e. a 'chassis swap' where the drives are removed from one server and installed into another), a recovery is not necessary. Instead the token can merely be moved along with the disks to the new server.

When this occurs, kbmd is expected (after suitable operator confirmation) to update the assigned cn_uuid for the token using PUT /pivtokens/:guid.

14.3. Triton install

Since our model requires setting up encryption at the time of zpool creation, we must add support when installing head nodes and compute nodes to setup encryption. As stated in Customer-facing features, we should not require encryption to be an all-or nothing deal — it should be permissible to allow both encrypted and unencrypted CNs in a Triton install.

Once the CN authentication pieces are available, it may even be permissible that a user to have no encrypted CNs, and instead uses the RFD77 delivered pieces (with tokens on CNs) for encrypted/authenticated control plane traffic between CNs and HNs. Either use will require the KBMAPI service to be installed and running.

15. Cryptographic algorithms

One important part of any design involving crytographic primitives is the choice of algorithms in use. This section is devoted to discussion about options and trade-offs made in algorithm choice above.

15.1. At-rest encryption

The algorithm to be used for at-rest encryption key derivation is ECDH on P-256 with KDF SHA2-512. This is chosen because:

Using ECDH with an ephemeral key to derive symmetric keys for authenticated file encryption is very well-studied and specified (e.g. as ECIES in SEC-1).
The ability to stack the encryption "boxes" to allow multiple EC private keys to be used to decrypt the final key has many desirable operational properties over a scheme based on symmetric keys (e.g. no need for online backups).
The P-256 curve is believed to be 128-bit secure and ECDH with it is well supported on both Yubikeys and JavaCard hardware.
Ed25519 and Curve25519 ECDH were also considered, but lack of hardware support makes them impractical at the present time.

15.2. Public-private encryption

The algorithm used for hardware authentication keys is RSA at 2048-bit key lengths. This is chosen because:

RSA is a widely used and well-studied cryptographic algorithm for signing and authentication.
The 2048-bit key length is chosen as a trade-off between security level and performance — Yubikeys and JavaCards are very slow at computing 4096-bit RSA signatures (on the order of hundreds of milliseconds).
Alternatives are not well-supported:
- Ed25519 is not supported in either Yubikeys or JavaCard hardware.
- ECDSA on NIST P-curves is supported by Yubikeys but not most JavaCard hardware options at this time.

RSA in Smartcard devices has a mixed history of side-channel attacks, but modern hardware has extensive mitigations to lower their impact. The lack of widespread support for alternatives at the present time is the main limiting factor here.

15.3. Soft token

Soft tokens will support Ed25519 and RSA-4096 for public/private cryptography. They will also support ChaCha20-Poly1305 for symmetric key operations (with the key protected by the same ECDH box scheme as above).

Ed25519 and RSA-4096 are chosen because:

Ed25519’s reference implementation is of excellent code quality and readily useable for the soft token.
Ed25519 is highly side-channel resistant, particularly to CPU cache timing side-channels. The soft token must run on the same hardware as customer workload, and possibly the workloads of other customers, meaning that resistance to side-channel attacks is paramount.
RSA is available in addition to Ed25519, as Ed25519 is not yet widely supported in TLS and X.509 certificates. The RSA key can only be used for signing X.509 certificates as outlined above, and not for general authentication.
ECDSA has a questionable history with respect to side-channel attacks, with many more successful attacks documented than on the other algorithms considered, so it was eliminated.

ChaCha20-Poly1305 is chosen because:

It is a strong AEAD cipher + MAC combination that has been quite well-studied despite being younger than AES.
Its implementation is simpler and built from the beginning to support authenticated operation, when compared with AES and other families.
It is explicitly designed for side-channel resistance. While AES could have been chosen, assuming that AES-NI or SSE3 are available, it is desirable to not have to require these CPU features for the system to operate safely.

16. Recovery Scenarios

This is currently a placeholder, the intention is to add recovery scenarios and the example steps that would occur to recover. The design is such that as long as the rfd77:config property is intact, one should be able to either use the token 9a key or N of M recovery secrets (assigned to individuals) to recover the encryption key and mount the zpool. We want to be able to validate the sequence of steps (including potential mis-steps) with the actions of KBMAPI and kbmd to minimize (or hopefully eliminate) any scenarios that lead to the loss of the zpool encryption key (and thus loss of data in the zpool).

1. It is also worth noting that with this rule, there is no real advantage to permanently having exactly 2 trusted CNs — it will cost in terms of overhead without increasing security, since a single signature is still all that is required to update the trusted set.

2. "Alternative chains" here refers to the TLS notion of providing a single entity certificate, signed by a single issuer DN, and then providing multiple certificates for that issuer DN that are signed by different upstream issuers themselves. This practice is already commonly used in the Internet today when introducing new CAs and is quite widely supported.

Files

README.adoc

Latest commit

History