authors | state | discussion |
---|---|---|
Mike Gerdts <mike.gerdts@joyent.com> |
predraft |
This RFD describes containers on Linux compute nodes. This is part of a larger effort described in RFD 177..
The Linux Compute Node project intends to introduce Linux containers on Linux
Compute Nodes. The intent is for native Linux containers to fill the role
traditionally filled by lx on SmartOS. Since the SmartOS lx brand was created
to emulate a Linux kernel, lx images just work on Linux, except in those
places where the image was customized to take advantage of SmartOS features.
For example, most lx images will try to use the SmartOS zfs
command as
/native/usr/sbin/zfs
, a path that does not exist on Linux.
Linux containers and zones share many concepts, but the implementation is quite different. While zones have a bespoke set of utilities for configuration and administration, no such thing exists for Linux containers. To the contrary, Linux containers lack a firm definition in practice or code and the various container management tools vary in which containment features they use.
A Linux container can be nebulously defined as a collection of name spaces and control groups that provide isolation and resource controls. Unlike zones, containers have no unique in-kernel ID. Taken together, this makes it rather easy to create a container that does a poor job of containing the things that run inside it. For example, some container managers do not virtualize the UID namespace. Without a distinguishing in-kernel container ID, this means that the root user in the container is the same as the root user outside of the container, which has been repeatedly leveraged in container escapes.
Since Linux containers are intended to be managed using Triton APIs, the focus of this effort is to provide the glue between the container features found in popular Linux distributions and the Triton APIs. Particular care must be taken to ensure that security best practices are used.
machinectl
uses terminology that is generally consistent with that used in Triton. Terms
that are important to this document are:
- A virtual machine (VM) virtualizes hardware to run full operating system (OS) instances (including their kernels) in a virtualized environment on top of the host OS.
- A container shares the hardware and OS kernel with the host OS, in order to run OS userspace instances on top the host OS.
- Machine is a generic term to refer to a virtual machine or a container. Instance has sometimes been used in place of machine.
- Image has multiple meanings, depending on the context. In Triton, an image
is a machine image that may be cloned to create a machine. It is typically
obtained through
IMGAPI.
machinectl
expands on this definition by considering the on-disk bits used by a specific machine to be that machine's image. In contrast, Triton would normally consider a specific machine's image to be the storage that was cloned to build the virtual machine.
Linux CNs are intended to only support containers, but many of the management concepts apply equally well to containers and virtual machines. When no distinction is needed, machine will be used instead of container.
The implementation aims to be as distribution agnostic as possible to allow flexibility in choosing to run on a different distributions, as the market dictates. Partially for this reason, the implementation will be leverage systemd-nspawn for most aspects of container management. Notable exceptions include:
- Images will be managed using node-imgadm.
- Instance installation will be performed by node-vmadm.
As much as possible, native tools will be usable to observe and control machines.
To allow for evolution of the platform, configuration will be maintained in a Triton-centric form and transformed into the form appropriate for the platform image at CN boot time and as machines are added/removed/changed through the course of normal operation.
The following files and directories are required for a machine.
/<pool>/<uuid>
: the mountpoint of the machine's dataset,<pool>/<uuid>
./root
: a subdirectory containing the container's root file system./config
: a subdirectory containing instance metadata, typically as json files.
/run/systemd/nspawn/<uuid>.nspawn
: The machine's systemd.nspawn(5) configuration file./var/lib/machines/<uuid>
: a symbolic link to/<pool>/<uuid>/root
that exists for compatibility withmachinectl
andsystemd-nspawn@.service
.
To support persistence of image and machine configuration across reboots,
/var/triton
, will contain the
following:
/var/triton/
mounted from<systempool>/system/var/triton
imgadm/
- the same as/var/imgadm
on SmartOS. A compatibility symbolic link will exist in the PI.imgadm.conf
images/
<uuid>
.json - Image manifest for a particular image- ...
vmadm/
- similar structure toimgadm/
, analogous to/etc/zones
on SmartOS.vmadm.conf
- any required configuration. Perhaps only includes a configuration version initially.machines/
<uuid>
.json - Payload for a particular machine. The content that is authoritatively stored in/<pool>/<uuid>/config/*.json
is not included in this file.- ...
The systemd-nspawn configuration is stored
as/run/systemd/nspawn/<uuid>.nspawn
. There are alternative locations for this
configuration file and alternative means for configuring per-instance nspawn
parameters. This location was chosen for the following reasons:
- If the configuration is at
/var/lib/machines/<uuid>.nspawn
, thesystemd-nspawn@service
start command would need to be customized to trust configuration. If Linux CNs were to eventually support virtual machines, this file would need to be in a different location (next to the disk image). /run/systemd/nspawn/<uuid>.nspawn
is unsuitable because it would not persist across reboots.- A per-machine systemd unit file, stored at
/run/systemd/system/systemd-nspawn@<uuid>.service
could be created with all the required command line options. To get systemd to recognize this file,systemdctl daemon-reload
would need to be invoked. This seems like a heavy-weight operation.
The typical /run/systemd/nspawn/<uuid>.nspawn
file will look like:
[Exec]
Boot=on
PrivateUsers=pick
MachineID=371f18c0-9f73-6e86-94f6-c1cf71188d23
[Network]
Private=yes
MACVLAN=external0
Resource controls can be managed via dbus, allowing for live updates. For example:
uuid=371f18c0-9f73-6e86-94f6-c1cf71188d23
uuid_mangled=${uuid//-/_2e}
prop=MemoryMax
newval=$(( 1024 * 1024 * 1024 ))
# Set to false to update
# /etc/systemd/system.control/systemd-nspawn@$uuid.service.d/50-$prop.conf
runtime_only=true
busctl call org.freedesktop.systemd1 \
/org/freedesktop/systemd1/unit/systemd_2dnspawn_$uuid_mangled \
org.freedesktop.systemd1.Unit \
SetProperties 'ba(sv)' $runtime_only 1 $prop t $newval
There are a couple of node module that provide an easy way to interact with dbus programmatically:
- dbus-next is a pure JavaScript implementation that is actively maintained. It is the successor to dbus-native, which is deprecated.
- node-dbus is a mixture of C++ and JavaScript. It seems to be less active both in maintenance and popularity on npm.
Prototyping will start with dbus-next.
The linux implementation of node-vmadm will have at least the following machine states:
- configured: The machine payload,
/var/triton/vmadm/machines/<uuid>.json
, exists and the machine's dataset exists. - installed: The machine payload has been transformed into native configuration.
- running: The init process in the container is running.
The state transitions happen via configure
, unconfigure
, install
,
uninstall
, start
, and stop
primitive functions, some of which are not
exported.
+------------+
| no state |
+------------+
| ^
configure() | | unconfigure()
V |
+------------+
| configured |
+------------+
| ^
install() | | uninstall()
V |
+------------+
| installed |
+------------+
| ^
start() | | stop()
V |
+------------+
| running |
+------------+
Higher level functions are composed of primitive functions. For example:
create()
invokesconfigure()
andinstall()
.delete()
may invokestop()
,uninstall()
, andunconfigure()
.reboot()
invokesstop()
andstart()
.
Unlike with SmartOS, vmadm.install()
will be called on each boot by a systemd
generator. Its job is to ensure that the Triton configuration has been
transformed into Linux native configuration.
node-vmadm
will gain a vmadm
command that mimics the command of the same
name found on SmartOS.
With a few exceptions, the VMAPI properties
will be stored in /var/triton/vmadm/machines/<uuid>.json
. The exceptions are:
customer_metadata
is stored in/<pool>/<uuid>/config/metadata.json
in thecustomer_metadata
key.internal_metadata
is stored in/<pool>/<uuid>/config/metadata.json
in theinternal_metadata
key.last_modified
is derived from the timestamp of the last modification. If there is no in-memory state that tracks this, then it will be the modification time of the newest of/var/triton/vmadm/machines/<uuid>.json
and/<pool>/<uuid>/config/*.json
.platform_buildstamp
comes fromTRITON_RELEASE
in/etc/os-release
.routes
is stored in/<pool>/<uuid>/config/routes.json
.server_uuid
comes from DMI (e.g./usr/sbin/dmidecode -s system-uuid
).snapshots
comes from listing snapshots on the appropriate dataset(s).state
is derived from the state ofsystemd-nspawn@<uuid>.service
or on-disk state if there is no such service.tags
is stored in/<pool>/<uuid>/config/tags.json
.zfs_filesystem
is dynamically generated as<zpool>/<uuid>
.zone_state
does not exist.zonepath
is/<zfs_filesystem>
.
Various VMAPI properties map to run time state, as described below:
Property | Maps To |
---|---|
cpu_cap | dbus org.freedesktop.systemd1.Unit CPUQuota |
cpu_shares | dbus org.freedesktop.systemd1.Unit CPUWeight |
hostname | nspawn: Exec.Hostname |
init_name | nsapwn: Exec.Parameters |
max_locked_memory | Not supported |
max_lwps | dbus org.freedesktop.systemd1.Unit TasksMax |
max_physical_memory | dbus org.freedesktop.systemd1.Unit MemoryHigh |
max_swap | dbus org.freedesktop.systemd1.Unit MemomorySwapMax , but keep in mind that this is swap space usage, not memory reservation |
nics | nspawn: Network.MACVLAN, plus scripting? |
pid | TBD |
quota | zfs property, same as SmartOS |
ram | TBD: how is this different from max_physical_memory? |
snapshots | Not implemented initially |
state | dbus org/freedesktop/machine1/machine/<uuid> org.freedesktop.DBus.Properties State |
zfs_data_compression | zfs property, same as SmartOS |
XXX Networking configuration is rather uncertain at this point. There are some options:
- Require cloud-init in each image
- Add another container that configures networking using tools that are under host control. After that container comes up, start the desired container with JoinsNamespaceOf so that it has the network namespace configured via the controlled environment.
- Use
ExecStartPost
to run
ip netns exec
commands to configure the network. This seems likely to race with things that are starting in the container, so it may require a customized init that waits for network configuration to be complete.
CN Agent has backends for SmartOS and
dummy (mockcloud). The backends make use of imgadm
,
node-vmadm, and other modules. This
document is primarily concerned with node-vmadm
.
node-vmadm
also has per-platform backends. A backend will be added that
implements the API by interacting
with dbus as much as possible. There are parts of create
, delete
, and
update that will require manipulation of datasets and files.
The create function will:
- Create
/var/triton/vmadm/machines/<uuid>
link. - Clone the image
- Populate
/<pool>/<uuid>/config/*.json
- Set resource controls using dbus.
- Create
/run/systemd/nspawn/<uuid>.nspawn
, as described above.
The order should be arranged such that if a create
operation is interrupted it
is possible to determine that the creation was not complete and to identify all
of the components that are related to the instance. The existence of the
/usr/lib/machines/<uuid>
link indicates the machine creation has begun. The
existence of /run/systemd/nspawn/<uuid>.nspawn
indicates that the creation has
completed.
The delete function
will undo the operations performed by create()
, in the reverse order from
create.
The kill function will send the specified signal to the init process.
The reboot function
will reboot the instance, similar to machinectl reboot
.
Not implemented initially.
The start function
will start systemd-nspawn@<uuid>.service
.
XXX It remains to be determined if that is sufficient: there may be additional networking setup or other actions required.
XXX How does this related to machinectl enable
?
The stop function
will perform the equivalent of machinectl stop
(without force) or machinectl terminate
(with force).
XXX How does this related to machinectl disable
?
The sysrq function will be a no-op.
The update function
will be make the modifications to the machine in a manner using the same
mechanisms used during create
and perhaps delete
.
The load function will load the machine's properties from the authoritative sources described above in VMAPI mapping.
The lookup
function will perform a load
on every VM, removing those that do not match the
filter specified by search
. If opts.fields
is specified, fields not listed
are elided.
Not implemented initially.
Not implemented initially.
Not implemented initially.
Not implemented initially.
Not implemented initially.
It is anticipated that this will be built on watching for relevant dbus events.
The responsibilities and theory of operation of VM agent are described in
vm-agent.js
.
On the Linux port will follow the same general operation, but the implementation
will leverage dbus and inotify to get updates about machine and file system
state changes. It is likely that node-vmadm
will be useful.