authors | state |
---|---|
Robert Mustacchi <rm@joyent.com> |
draft |
If there is one rule of firmware, it is that it will inevitably have bugs and that field upgrades will be required. It is our experience that this is certainly true for disk drives, where this problem is most egregious due to the sheer quantity of drives that need to be upgraded.
While the broader system should make managing and rolling out such upgrades for all types of FRUs (field replacement units) easier, the present focus of this RFD is on the interfaces that illumos provides for upgrading disks that leverage SCSI.
The history of firmware upgrades on systems is fraught with peril, DOS, and then more peril. While an individual FRU cannot remain in service while a firmware upgrade is being performed, having to switch to an alternative operating system and take down service for the entire box is not acceptable. To that end, illumos has the fwflash(1M) command which can be used to both list the firmware versions of parts and perform firmware upgrade.
Firmware upgrade of disks is plagued by two different, but related problems.
- The fwflash(1M) tool hard codes a maximum firmware size of 1.4 MiB which is suspiciously close to the size of a floppy disk.
There is no basis for this value. SCSI itself supports a maximum transfer of around 16 MiB and provides facilities to query what the maximum supported buffer size is and devices advertise the maximum size that they'll support.
- uscsi(7I) does not support partial DMA. This is compounded by the general lack of use of an IOMMU on x86.
It's worth talking about why this is a problem in more detail. An example of how it might manifest is discussed in illumos#5012. If a given operation cannot fit in one contiguous DMA buffer, then the operation must be spread out across multiple DMA windows. The underlying buffer is bound into the smaller DMA window. It precedes to slide along to perform this activity, binding different portions. The end result is that a single logical operation is broken up into multiple operations.
This works; however, it is complicated in the face of retries. If for some reason a device needs to retry an operation, then it may have to replay the entire transaction, meaning all of the various commands that caused DMA transactions. This is fine if and only if the various operations are idempotent during a given period. For example, when performing reads and writes of a sector of a device, those operations are idempotent, even though you may opt to overwrite that disk sector some time in the future, while performing the single I/O it will always be the same.
uscsi(7I) doesn't have this same guarantee. Because uscsi(7I) allows for arbitrary SCSI commands to be generated and sent, they may or may not be idempotent at all, as the underlying set of SCSI commands that are possible to send are quite varied.
While an I/O MMU does solve address this problem by allowing the logical DMA to be broken up into multiple series of physically disjoint pages, many x86 systems do not have an I/O MMU or enable it. Therefore a solution needs to be worked out that addresses this.
The key observation here is that we're trying to perform firmware upgrade specifically, not solve this problem for every possible uscsi(7I) ioctl that one might issue. This means that we specifically care about the WRITE BUFFER SCSI command (SPC-3 6.35), which has different modes for writing firmware to devices. There are two modes that we generally care about. Note, the phrase microcode is used below to match the SCSI specification; however, it can be used interchangeably with firmware.
- Mode 5 - Download Microcode Data (to the device) and Save
- Mode 7 - Download Microcode Data (to the device) with Offsets and Save
Mode 5 is used to perform a single download of the entire firmware image in a single SCSI command. This is the form that the fwflash(1M) utility uses today.
Mode 7 is designed to allow for multiple writes into the buffer. Specifically the specification allows for the firmware download to occur across multiple WRITE BUFFER requests. Using this mode, we can handle the partial writes as long as we can determine what the right size of the buffer is.
Ideally we would say that if we could do the entire write of the firmware image in one go then we'll use a single WRITE BUFFER command with mode 5, otherwise we'll issue a number of mode 7 requests. Mode 7 offsets are subject to a required alignment which we can determine via the READ BUFFER command (SPC-3 6.15) mode 3.
For this to work though, we need to know what that maximum buffer size
is. While it may be tempting to try and determine it by doing a single
mode 5 write and then having the buffer size until one works, it makes
more sense to instead plumb that through the stack as the kernel
actually knows. For example, in sd(7D), it's contained in the struct sd_lun
's un_max_xfer_size
member. The first step of this is to allow a
user to query this through uscsi(7I).
To allow consumers of uscsi(7I) to determine what the actual maximum
transfer size is, I propose to add a new ioctl that drivers may support
called USCSIMAXXFER
along with a new type that is used with the ioctl.
The following is an excerpt from the updated uscsi(7I) manual page which
describes the ioctl.
USCSIMAXXFER
The argument is a pointer to a uscsi_xfer_t value. The
maximum transfer size that can be used with the USCSICMD
ioctl for the current device will be returned in the
uscsi_xfer_t. The actual transfer size may be limited
further based on the specific SCSI device and details
of the implemented command.
Not all devices which support the USCSICMD ioctl also
support the USCSIMAXXFER ioctl.
The definition of the uscsi_xfer_t is provided in the uscsi header files and is simply a uint64_t. It looks like:
typedef uint64_t usci_xfer_t
This new ioctl has the exact same requirements for use as the existing
USCSICMD
ioctl. A user that does not have the privilege to use the
USCSICMD
ioctl will not be able to use the USCSIMAXXFER
ioctl.
Note that at this time, only sd(7D) will be enhanced to support the
USCSIMAXXFER
ioctl. This is part of the reason that a new ioctl was
chosen and that the reserved portion of a uscsi_cmd_t
was not used, to
allow for different devices to opt into supporting this as the need
arises.
libscsi is a private library that was introduced by Eric Schrock in
PSARC 2008/196 libscsi and libses
. The work originally came as part of
the work done by Fishworks. While we do not have the same firmware
upgrade tooling that the team there used (it is lost to the sands of
time behind closed doors at Oracle), it was built upon libscsi and
friends.
Importantly, fwflash(1M) uses libscsi to do the heavy lifting in a rather useful way. To enable this, a new function will be added to libscsi to allow a user to determine the maximum amount of bytes that can be transferred in a single command. The current function prototype looks like:
extern int libscsi_max_transfer(libscsi_target_t *, size_t *);
This function will leverage the appropriate libscsi engine (currently
only uscsi) based on the target to determine the maximum transfer size.
Note a size_t
here is explicitly chosen for a few reasons as opposed
to using a 64-bit capable type similar to the uscsi_xfer_t
. libscsi
already describes buffer sizes using a size_t
(see
libscsi_get_buffer()
or libscsi_set_datalen()
). Part of this is
likely based on the fact that uscsi(7I) leverages values of size_t
, thus
causing the ioctl to be different on ILP32 and LP64.
I made the concious choice to try and give an accurate value via the
uscsi(7I) ioctl interface even if consumers could not do more.
Practically speaking, this isn't a realistic problem as SCSI itself
generally has a 16 MiB maximum transfer size. If such a case where to
occur, the uscsi engine of libscsi would silently truncate the value at
the maximum size of a size_t
on the appropriate platform.
In addition, the libscsi engine API will need to be enhanced to provide
a new means of asking the engine this question. This private interface
will be amended to add another entry point to the libscsi_engine_ops_t
structure. It will add a new member that looks like:
int (*lseo_max_transfer)(libscsi_hdl_t *, void *, size_t *);
As part of this the value of LIBSCSI_VERSION
will be incremented to
version 2. Note as this is a private interface, older versions and out
of gate consumers should not be a problem; however, incrementing the
interface version should still be done.
Arguably, this is just a simple bug in fwflash(1M); however, as it ties into the previous section it's worth mentioning here again explicitly. To determine the maximum image size rather than assuming anything, we must actually ask the device. Specifically we can use the READ BUFFER command to determine the maximum size that will be accepted for the firmware image. The mode 3 option 'descriptor' returns both the maximum image size and the required offset alignment.
The generic SD verification module for fwflash(1M) will be updated to use this as the basis for the maximum size rather than its current assumption.
With these changes to the stack, it should be possible to address all of
the current issues around firmware upgrade of devices handled by sd.
This is particularly prominent as many users are using mpt_sas
which
has a single transfer size of 1 MiB and the firmware for many drives
exceeds that size.
It is planned that all of these changes will be integrated at the same time in illumos-joyent and after a period of additional production experience, integrated into illumos.
In the future, we should have RFDs that aggregate information about component firmware revisions across a fleet so that operators can understand what versions firmware is at and we should also look at providing the means for rolling out rolling upgrades of these across systems in a similar fashion to how the Fishworks appliances rolled out firmware upgrades across all the disks in a chassis.
Today fwflash(1M) can already handle general disks, SES devices, and some various IB and FC devices. We should evaluate having specific firmware update mechansims for SPs, NIC EEPROMs, etc. We should also do work to make sure that all of the versioning information is shared with the topo snapshot so that we can better aggregate and report on this in fmtopo and an eventual DC-wide aggregation.