authors | state |
---|---|
Jordan Hendricks <jordan.hendricks@joyent.com> |
publish |
This document will describe the proposed design and implementation of a multipart upload mechanism for Manta, similar to Amazon s3's multipart upload API.
In section 1, we discuss the proposed design for this feature in detail: first outlining desired operations for a multipart upload API, then reviewing design constraints and requirements for implementing such a feature in Manta, and finally, reasoning through each piece of the design. If you are interested in the motivation behind different designs for Manta multipart uploads, you should read this section.
In section 2, we provide a more concise summary of the design: the precise API calls and how they map to an HTTP request, as well as a summary of the required Manta metadata to achieve this design. If you only want to see a summary of the API, skip to this section.
Finally, in section 3, we revisit the desired operations and design constraints and requirements proposed in section 1 and summarize how these are addressed in the final design.
A multipart upload API allows a user of an object store to break up an object into distinct parts and upload them over time to create a single object in the object store. Such a feature may be useful for devices uploading large objects over volatile network connections or for allowing for parallelism uploading a large object with multiple devices.
At a minimum, there are a few things a consumer of a multipart upload API will need to do:
- Initiate a multipart upload for an object
- Upload the parts of that object
- Indicate that the upload is completed after all parts are uploaded
In addition to these operations, some others that are useful are:
- Canceling an upload in progress
- Querying an upload's state
- Listing parts that have been uploaded for an upload
- See all multipart uploads in progress
These operations are all available in Amazon's s3 storage API for multipart uploads.
From a Manta implementation perspective, at a minimum, we need the following things to implement a multipart upload API:
- A place to store the parts as they are uploaded
- A mechanism to create the final object from the uploaded parts
Additionally, to make the full suite of desired multipart upload operations work well, we need to:
- Ensure that parts can be listed easily
- Ensure that uploads can be listed easily
- Account for multiple clients using the API and various component failures (especially for the case of completing an upload)
- Account for cleanup of upload metadata and part data
Finally, we want to reuse existing code in Manta components wherever possible.
In terms of API constraints, we will adopt many of Amazon s3's general constraints for multipart uploads:
- Parts may be uploaded in any order
- There must be a consecutive set of parts from
{0..N}
specified on the commit of the object - An upload has a maximum of 10,000 parts
- Parts have a minimum size of 5 MB. This avoids creation of large numbers of tiny parts, which have an associated cost to store.
With these goals and constraints in mind, we can now begin to design the multipart upload implementation in Manta.
Similar to the Manta jobs API, which stores information about jobs at
/$ACCOUNT/jobs
in Manta, we can store multipart uploads under a special
uploads directory at the root of the account.
One way to do this is to create a single directory for each upload, with the the upload ID as the directory name, and store the parts for the upload in its directory. Having a well-defined directory structure with uploads in single directory and the parts of an upload also in a single directory is desirable, as it allows both uploads and their parts to be listed easily. But keeping all upload directories in the same parent directory also has a drawback: The upper bound of ongoing multipart uploads for a given account is fixed at the number of allowed directory entries (a few million).
To circumvent this limitation while still limiting the number of shards upload
metadata records live on, we can compromise a bit. We distribute the uploads
into a fixed number of subdirectories within the special uploads
directory,
with each subdirectory as one possible character of a uuid. An upload is
stored in the subdirectory that is the first character of its uuid. This is
referred to as the multipart upload's "prefix" directory.
More concretely, the path to a given parts directory is thus:
/$ACCOUNT/uploads/[0-f]+/<upload uuid>
.
Parts are then stored as: /$ACCOUNT/uploads/[0-f]+/<upload uuid>/N
, where
N
is the part number of the part.
A prefix directory with a prefix length of 1 means a maximum of 16 shards need to be contacted to list all ongoing uploads for an account. The prefix length is a tunable, which allows operators to change the number of prefix directories if they find there is not enough capacity for ongoing mulitpart uploads. (See MANTA-3427 for more details.)
The completed object is a concatenation of all of the multipart upload's parts, which are finalized on commit. Because objects in Manta are immutable, this operation will require copying data at some point — We cannot simply append parts as they are uploaded to an existing object. Additionally, a multipart upload object could be very large, so we want to avoid copying parts over the network if possible.
As such, we will store all of the parts on all the storage nodes that the final object will live on. This way, the object can be constructed on the storage node itself, and copying data will only occur locally. This approach requires selecting the set of storage nodes the object will reside on when the upload is initiated, and ensuring that all parts are uploaded to this set.
When the upload is committed, we must do two things: idempotently create the object on disk on the storage node, as if it was placed there in a normal object PUT in Manta, and remove the parts to avoid taking up unnecessary space on the node. It is important that the construction of these two operations be idempotent, in case the storage node restarts and the operation has to be retried.
To do so, we propose a new operation: mako-finalize
. In this mechanism, the
storage node first concatenates the object parts to be committed to the object
— these are specified as a parameter to the commit. Once the storage node has
finished writing the object, it removes the parts, as it is now safe to do
so. Finally, mako-finalize
will return the md5sum of the object, so that
it can be stored in the metadata record of the committed object.
The idempotency of mako-finalize
is critical here. If the storage node
crashes before or in the middle of writing the object, mako-finalize
can
still reconstruct it, as none of the parts have yet been removed. Importantly,
this presumes a restriction on committing uploads: After a commit has begun,
the set of parts input to the commit (identified by their etags from the
client, and translated to object IDs by muskie) cannot change, even if the
commit is retried. Otherwise, we could get into a state where an uploaded
object with one set of parts is created, the parts removed from the mako node,
and the commit operation fails at a later step, leaving the client to retry
the commit. At this point, the parts have been cleaned up.
We also need to protect against multiple clients attempting a commit for the
same object ID. To do this, we must consider the case where two mako-finalize
operations are running in parallel for the same object ID. If this occurs,
there is a potential race between one operation beginning to remove parts
after writing out the object file, while another is attempting to use them.
We can solve this problem by having mako-finalize
append parts to a temporary
file first, then perform an atomic rename of the temporary file to the object
ID. We then do an fsync on the file and directory, before beginning to remove
parts. If mako-finalize
finds that some parts are missing when it tries to
append them to the temporary file, this means that the object already exists
(because another mako-finalize
operation has created it). We can then verify
that the correct object exists, and return an error otherwise.
To verify the final object is correct, mako-finalize
will at the least need
the number of bytes the final object will be as input, to differentiate between
a partially written object or a completed one. To provide additional protection
against data corruption, the client can also optionally provide an md5sum of
the object and/or the individual parts. If any of these sums don't match, the
operation returns an error.
The API for mako-finalize
will consist of an HTTP POST to the path
/mpu/v<version number>/commit
, with a JSON blob in the body of the request
that has the following structure:
{
"version": <version number of MPU API (currently version 1)>,
"nbytes": <integer size in bytes>,
"account": <string uuid of account>,
"objectId": <string uuid of output file>,
"parts": <JSON array of string part uuids>
}
In pseudocode, mako-finalize
does the following steps:
mako-finalize:
If objectId does not exist, then:
Create tempFile
For part in parts[]:
If part file exists:
Append part to tempFile
Else:
If objectId does not exist:
// This case should never happen in normal execution.
Return an error
Else:
// This case occurs when a concurrent mako-finalize operation has already created the object.
Validate object size (and optionally, md5sum)
Unlink tempFile
Goto FINISH
Validate size of tempFile
Fsync tempFile
Atomic rename tempFile to objectId
Else:
Validate size of objectId
FINISH:
Fsync parent dir of objectId
For each part associated with the upload ID:
Remove part
Return: md5sum(objectId file)
In an earlier draft of this RFD, this pseudocode validated the md5sum of the parts and the final object. The validation of the md5sum for the object is now completed in muskie. We may decide to extend this to parts later.
For most desired multipart upload operations, multiple clients attempting to perform the same operation do not present any issues:
- initiating an upload should return a unique upload ID for each client, even if the uploads are for the same path
- uploading a part will replace an existing part for a given upload ID if there is already one uploaded (analagous to a normal Manta PUT)
- listing uploads or parts is a metadata read (no state change occurs)
The more interesting operations here are committing and aborting an upload. It is important that both commit and abort be atomic operations, such that if multiple clients attempt either on the same upload ID, only one client will win (and the other receives a meaningful response). Relatedly, we also need to account for failures such that commits and aborts are idempotent operations — if the response to a commit or abort is lost, the client needs some way to observe that the commit or abort has succeeded. One way to do so is to make the operations idempotent.
In the case of a commit, there are several steps that need to happen:
- Insert an object record for the uploaded object.
- Create the object from its parts on the set of storage nodes (
mako-finalize
). - Update some metadata to indicate the operation has been committed, such that if another client tries to commit or abort, we can handle the request correctly.
In order for a commit to be atomic, the state of the upload must be able to be queried on the same shard as the object record — It is not sufficient to store the state on the shard of the upload directory record, which could be a different shard, nor can the state be contained in the object record itself, as the object could be deleted immediately after the upload is finished.
A clear solution is to atomically write a commit or abort record on the same shard the object record is inserted on. If a commit or abort record already exists for the same upload ID, then the insertion of these record(s) will fail. This approach ensures only one client will win the commit/abort race.
This approach also has a minor setback: When querying the status of a given
upload ID, we must always contact two different shards — the shard of
$ACCOUNT/uploads/[0-f]+
, to see if the upload exists, and the shard mapping
to the object path, to check if a commit or abort record exists for the upload
ID. Given that most of the lifetime of an upload will not be in a committed/aborted
state, it would be nice if we didn't have to do this extra lookup for every
query.
Instead, we will also add metadata to the upload record that records some state. As discussed, we cannot keep track of the state of the upload once a commit or abort has started (because the final state of the upload must be atomically recorded on the shard of the final object record), but we can at least differentiate between the upload being in progress, by virtue of it having been created, and a commit/abort operation having been started on the upload. Once a commit/abort has begun, the current state of the upload must be queried on the shard of the object record.
More succinctly, we can describe the states of the upload record as: CREATED
and FINALIZING
. If the state is CREATED
, then the upload has been started
and is in progress. In this case, there is no need to query any other shards
when checking the state of a given upload ID. Once a client calls commit
or abort
, then the first step of the commit/abort operations is to update
the state in the upload record to FINALIZING
, before doing any other
operations. A state of FINALIZING
indicates that the true status of the
multipart upload now lives on a different shard than that of the upoad record.
Additionally, as detailed in mako-finalize
, we need to enforce that the set
of parts specified on the first commit on an upload ID is the only set of
parts that can be committed. This information will also be stored as a summary
of the parts and etags (e.g., a checksum) in the upload record when the state
is updated to FINALZING
.
Thus, these are the final steps for a typical commit:
- Set state metadata of the upload directory record for
$ACCOUNT/uploads/[0-f]+/<upload uuid>
toFINALIZING
, with its type set tocommit
. In this same update, save a summary (e.g., a checksum) of the part numbers and their etags for the commit. 2. Invokemako-finalize
on the storage node set. - Atomically insert a commit record and the object record on the shard of
$OBJECT_PATH
, which is the input path for the object upon upload creation.
These are the steps for a typical abort:
- Set upload record state to
FINALIZING
, and its type toabort
. - Atomically insert an abort record on the shard of
$OBJECT_PATH
.
Now, we will augment these steps to make commits and aborts idempotent. Because
the mako-finalize
step already is idempotent, we only need to add a check
for the upload record state, and if necessary, an existing commit or abort
record.
This leaves us with the final design of the commit
operation:
- Read the state metadata of the upload directory record for
$ACCOUNT/uploads/[0-f]+/<upload uuid>
.- If the state is
CREATED
, update this record to have stateFINALIZING
, typecommit
, and record the summary (e.g., a checksum) of part numbers and their etags. - If the state is
FINALIZING
, typecommit
, then check the part numbers and etags specified in the commit against the metadata. If they don't match, return an error.
- If the state is
- Invoke
mako-finalize
on the storage node set. - If a commit record doesn't exist for the object, then atomically insert a
commit record and the object record on the shard of
$OBJECT_PATH
.
And the final design for the abort
operation:
- Read the state metadata of the upload directory record for
$ACCOUNT/uploads/[0-f]+/<upload uuid>
.- If the state is
CREATED
, set it toFINALIZING
, typeabort
. - If the state is
FINALIZING
, typecommit
, then return an error. - If the state is
FINALIZING
, typeabort
, then continue.
- If the state is
- If an abort record doesn't exist for the object on the shard of
OBJECT_PATH
, then atomically insert an abort record.
Commits and aborts look very similar, except that commits also invoke an operation on the storage node. Under this design, once a commit or abort has started, the finalization type of the upload cannot change. Put differently: a commit that has been started cannot be changed to an abort (for instance, after an internal error during the request), but it can be retried. This is true for the reverse case: an abort cannot be changed to a commit, but it can be retried. The ability to abort a commit that has returned a 500 may be desirable behavior, but it should be straightforward to add if we decide later that it is desirable.
To summarize the states of the multipart upload commit and abort operations,
we have abstracted the metadata state recorded in the upload record into the
following state machine for simplicity. Note that done
is an implicit state
(not recorded in the upload record), but guaranteed by the presence of an
abort or commit record on the shard of $OBJECT_PATH
.
mpu-create
+
|
|
v mpu-commit write commit/
+-----+-----+ mpu-abort +------------------------------+ abort record +------------------------------+
| created +--------------> | finalizing +-----------------> | done |
+-----------+ | type={aborting,committing} | | (committed, aborted, failed) |
+------------------------------+ +------------------------------+
We need a mechanism to ensure all unnecessary data associated with multipart uploads is properly removed when they are no longer needed. The items to be cleaned up include:
- Upload directory records
- Part object records
- Commit/abort records
- Part data
All of this data can be removed some period of time after a commit or abort has completed — the status of a commit/abort should still be able to be queried shortly after the commit/abort completes. To do so, we can modify the existing Manta garbage collection process to also clean up these items.
Thus the garbage collection process will be modified to iterate over commit/abort records. Commits must have their upload directory records and part records removed. Aborted uploads need to have upload directory records, part records and part data removed. Once this cleanup is completed, the commit/abort records must also be removed.
A more comprehensive overview of how this process occurs can be found in documentation within the manta-mola repository.
This is a summary of the final API based on the discussion above. Each API calls lists the general inputs and outputs and the high-level steps each operation takes, including some error conditions.
Creates a new multipart upload.
- Inputs:
$OBJECT_PATH
- Outputs: upload uuid
- Steps:
-
- Generate a new upload uuid, and insert a directory record for
/$ACCOUNT/uploads/[0-f]+/uuid
with its state set toCREATED
.
- Generate a new upload uuid, and insert a directory record for
-
Uploads data to a given part number of an upload.
- Inputs: upload uuid, part number, part data
- Outputs: etag of part
- Steps:
-
- Read the upload state in the object record for
/$ACCOUNT/uploads/[0-f]+/uuidd
.- If the state is
CREATED
, insert object record for/$ACCOUNT/uploads/[0-f]+/uuid/N
, whereN
is the part number. - If the state is
FINALIZING
, return an error.
- If the state is
- Read the upload state in the object record for
-
Note: This API call has a race between reading the upload record's state and inserting the part record, which could be on a different shard. It is possible that another client starts a commit or abort for the same upload id after the read of the upload record has occurred, so the client uploading a part will not learn that the upload is being finalized (likely with a different set of parts). This case shouldn't be common, as it indicates a strange (and possibly buggy) use of the API. As long as we document this possible case clearly, I think this is acceptable.
Creates the final object from a set of uploaded parts.
- Inputs: upload uuid, set of part numbers and their associated etags
- Outputs: none
- Steps:
-
- Read the state of upload record and check its state.
- If the state is
CREATED
, read the object record for each input part and verify the etag of the part matches the input etag. - If the parts and etags summary matches the input set, then update
the upload record state to
FINALIZING
, typecommit
, and save the summary in the upload record. - If the parts and etags summary doesn't match the input set, then return an error.
- If the state is
FINALIZING
, typecommit
, then check the part numbers and etags summary specified in the commit against the metadata. If they don't match, return an error. - If the state is
FINALIZING
, typeabort
, then return an error.
-
- Invoke
mako-finalize
on the storage node set.
- Invoke
-
- If a commit/abort record doesn't exist for the object, then atomically insert a commit record and the object record on the shard of
$OBJECT_PATH
.
- If a commit/abort record doesn't exist for the object, then atomically insert a commit record and the object record on the shard of
-
Aborts an upload.
- Inputs: upload uuid
- Outputs: none
- Steps:
-
- Read the upload record and check its state.
- If the state is
CREATED
, set it toFINALIZING
, typeabort
. - If the state is
FINALIZING
, typecommit
, then return an error. - If the state is
FINALIZING
, typeabort
, then continue.
-
- If a commit/abort record doesn't exist for the object, then atomically insert an abort record.
-
Lists the parts uploaded to an upload.
- Inputs: upload uuid
- Outputs: set of part numbers, their associated etags, and other relevant metadata, such as part size
- Steps:
-
- Read all records in the directory
/$ACCOUNT/uploads/[0-f]+/uuid
.
- Read all records in the directory
-
Lists all uploads for an account.
- Inputs: none
- Outputs: list of upload uuids (and information about them, such as the
OBJECT_PATH
associated with the upload and the upload state) * Steps:-
- Read all records in the directory
/$ACCOUNT/uploads
.
- Read all records in the directory
-
- Inputs: uuid * Outputs: {
CREATED
,FINALIZING
typecommit
,FINALIZING
typeabort
,DONE
} * Steps:-
- Read the upload record and return its state.
-
The following table demonstrates how to access these API calls over HTTP. The DELETE method is not supported for any operation. An "X" represents an explicitly disallowed action, and a "-" represents an allowed action that doesn't have a corresponding API name.
GET | HEAD | PUT | POST | |
---|---|---|---|---|
/$ACCOUNT/uploads/[0-f]+/uuid | list-parts |
- | X | X |
/$ACCOUNT/uploads/[0-f]+/uuid/state | get-mpu |
X | X | X |
/$ACCOUNT/uploads/[0-f]+/uuid/:partNum | X | - | upload-part |
X |
/$ACCOUNT/uploads/[0-f]+/uuid/commit | X | X | X | commit-mpu |
/$ACCOUNT/uploads/[0-f]+/uuid/abort | X | X | X | abort-mpu |
Note that GETs of parts are disallowed. This is important because allowing GETs for part data would allow abuse of the API to store objects on the same storage node, circumventing our mechanism of selecting storage nodes.
Some API methods require a JSON object in the body of the request that is used as input for the request.
For creating an upload, you must provide a JSON object with
the key objectPath
and a string value representing the Manta path to store
the object in on commit. For example:
POST /jhendricks/uploads
Request body:
{
objectPath: "/jhendricks/stor/mpu-object.txt"
}
You may also optionally provide a headers
object in the input JSON object.
All headers for the committed object must be supplied when you create the
upload.
The following headers are supported for multipart uploads:
x-durability-level
: This header determines the number of copies of your object will be created (defaults to 2)content-length
: This header indicates the size of your object, if known on create. This value is validated when the object is committed, and if the size of the object does not match the header specified here, the commit will fail.content-md5
: This header indicates the MD5 sum of your object, if known on create. This value is validated when the object is committed, and if the md5 sum of the object does not match the header specified here, the commit will fail.
For example, to create a multipart upload of a 13-byte object with 3 copies, you would send:
POST /jhendricks/uploads
Request body:
{
"objectPath": "/jhendricks/stor/mpu-object.txt",
"headers": {
"x-durability-level": 3,
"content-length": 13
}
}
Committing an upload requires a specification of the etags of a consecutive set of parts, beginning with part 0. To indicate the etags of each part, you must supply a JSON object with the field "parts", which is an array of strings representing the etags of each part, such that the index in the array for a given etag indicates the part number. (The multipart uploads API supports a range of: [0-9999], for a maximum of 10000 parts per upload.)
For example, to commit an upload that has upload ID
3db78672-4ac8-4277-865d-c75fed2f690a
and 2 parts, in which part 0 has etag
c504b2fb-929b-4179-a81b-2568352e7bb5
, and part 1 has etag
bbdbacd4-cd3d-47ec-9b0c-0f932d8c44e6
, you would send:
POST /jhendricks/uploads/3/3db78672-4ac8-4277-865d-c75fed2f690a
Request body:
{
"parts": [
"c504b2fb-929b-4179-a81b-2568352e7bb5",
"bbdbacd4-cd3d-47ec-9b0c-0f932d8c44e6"
]
}
This table summarizes the metadata discussed in this design. The "shard" field
represents the corresponding Manta shard for the given path. Here, UPLOAD_PATH
refers to the path to the upload directory: /$ACCOUNT/uploads/[0-f]+\<upload uuid\>
Upload Record | Part Record | Commit/Abort Record | Object Record | |
---|---|---|---|---|
Type | Directory | Object | Finalizing Record (new type) | Object |
Shard | dirname($UPLOAD_PATH ) |
$UPLOAD_PATH |
dirname($OBJECT_PATH ) |
dirname($OBJECT_PATH ) |
Bucket | MANTA | MANTA | UPLOADS (new bucket) | MANTA |
Additional Metadata | * State={CREATED ,FINALIZING } * If state= FINALIZING , then type={commit ,abort } * If type= commit , then a summary of part numbers and etggs included (checksum) |
N/A | N/A | N/A |
In the above design discussion, we addressed all of the desired operations for the end user and implementation requirements outlined in section 1. To reiterate, here is an abridged version connecting these initial goals back to the final design:
Desired Operations:
- Initiate a multipart upload for an object:
create-mpu
- Upload the parts of that object:
upload-part
- Indicate that the upload is completed after all parts are uploaded:
commit
- Canceling an upload in progress:
abort
- Querying an upload's state:
get-mpu
- Listing parts that have been uploaded for an upload:
list-parts
- See all multipart uploads in progress:
list-mpu
Implementation Requirements and Constraints:
- A place to store the parts as they are uploaded:
/$ACCOUNT/uploads/[0-f]+/uuid
- A mechanism to create the final object from the uploaded parts:
mako-finalize
- Ensure that parts may be listed easily: All parts stored in same directory (and thus their metadata is on the same shard)
- Ensure that uploads may be listed easily: All uploads listed in a small number of directories (few shards to contact)
- Account for multiple clients and component failures, particularly when
considering finalizing an upload: atomic and idempotent
commit
/abort
- Account for cleanup of upload metadata and part data: garbage collection.
Multipart uploads will not impact jobs, as jobs are not allowed to operate on parts. Once an upload has been committed, the resulting object is indistiguishable from any other Manta object.
Users of multipart uploads will be metered and billed for parts just like any other object. We also will not be timing out uploads that have been inactive -- it is the responsibility of the user to commit or abort uploads.
The node-manta repository contains a
node SDK and CLI to use the multipart upload API. Multipart upload operations
can be exercised as an end user using the command line tool mmpu
.
At a minimum, this feature will require changing the following repositories:
- manta-muskie
- manta-mako
- node-libmanta
- manta-mola
- node-manta
We will be adding new API endpoints through muskie. API calls for multipart
upload will be authenticated as other Manta API calls to muskie are. We will
also need to consider potential security impact in the implementation of the
mako-finalize
mechanism, since this is an additional interface added to
mako.