Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "smart" upload function #90

Open
cdhowie opened this issue Jan 3, 2020 · 9 comments
Open

Add "smart" upload function #90

cdhowie opened this issue Jan 3, 2020 · 9 comments

Comments

@cdhowie
Copy link

cdhowie commented Jan 3, 2020

The AWS S3 SDK for JavaScript has an upload function that does not correspond to any particular API request. You can give it a buffer or a stream, and it will automatically perform either a single PutObject call or a multi-part upload.

It would be a great benefit to this library to provide something similar. Right now, large file uploads are unnecessarily cumbersome, especially when the input is a stream. Authorization token management is a giant pain.

I am working on such a function right now for our own internal use. I'm writing it as a module that exposes a single function that can be attached to the prototype of the B2 class provided by this library (B2.prototype.uploadAny = require('backblaze-b2-upload-any');).

This issue is intended to convey my intent to integrate this function into this library and submit a PR. Therefore, I would very much appreciate any feedback on my proposal so that I can accommodate any necessary design changes as early as possible.

The current planned features of this function (many of which are already done) are:

  • Performs the upload using a single upload_file call or switches to a large-file upload as appropriate.
  • In large-file mode, uploads multiple parts with configurable concurrency.
  • Automatic management of upload tokens. An upload token (URL + authorization token) can be reused by future part uploads. Expired tokens (where the server returns 503 or 400) are discarded.
  • Automatic re-authorization if the server returns 401 in the middle of an upload.
  • Retry with exponential backoff.
  • Support for uploading:
    • Buffers
    • Streams
    • Local files (specified as a string path)
  • If the operation is aborted for whatever reason, any outstanding large-file upload is canceled with cancel_large_file.
  • The caller need not (and cannot) supply a hash. When uploading in large-file mode, a hash of the entire content is not provided to B2 -- a hash is provided for each part. A caller-supplied hash of the content is therefore useless in large-file mode anyway.

There is a difference between the local file and stream cases. When uploading a local file, no content is buffered in memory. Rather, multiple read streams are created (and re-created as necessary if a part upload must be retried).

Stream support necessarily requires some buffering in memory to facilitate retries since node streams cannot be seeked (and not all stream types would be seekable, anyway).

Note that I currently introduce two new dependencies:

  • @hapi/joi which is used to validate the options object.
  • memoizee which is used during re-authorization. If multiple part uploads are trying to re-authorize at the same time, this prevents multiple authorize calls to B2.
@cbess
Copy link
Collaborator

cbess commented Jan 5, 2020

I'm definitely in favor of this change. I personally use joi in some of my projects as well. memoizee seems pretty solid as well. But, adding these would definitely grow the dependency tree. I think your design changes seem fine, but I'd like to see the final implementation (code review).

@cdhowie
Copy link
Author

cdhowie commented Feb 7, 2020

I had to take a hiatus from this development but will be back working on it this month.

I am considering introducing another dependency maintained by me as it will simplify much of the logic: it is a function that accepts an iterator or async iterator and returns a promise for an array collecting the results of invoking a mapper function over each element produced by the iterator, in the order the iterator produced the elements. The amount of concurrency can be specified in the same way as Bluebird's Promise.map(). It does this without converting the iterator to an array first, so it can be used on very large sequences and/or sequences of an indeterminate size.

This would be used in the piecewise upload routine. Each upload source (buffer, file, stream) would have an iterator that produces pieces to upload.

@odensc
Copy link
Collaborator

odensc commented Feb 13, 2020

Sounds fine if it's something that would be useful on it's own as a module or is large enough to warrant being separate. But if it's just a small helper function it'd be better to throw it in this module.

@cdhowie
Copy link
Author

cdhowie commented Feb 17, 2020

I decided not to bring in this other dependency. I might later, but what I have today works (I've tested it against large files/buffers). I'm also not sure what environments are supported by this module (backblaze-b2) and whether they all support ECMAScript's asynchronous iteration feature (Symbol.asyncIterator) so I decided not to use this feature after all, at least not yet.

I did bring in one more dependency (p-retry) to handle exponential back-off following a temporary upload failure.

Sometime this week I hope to publish this function as a module on npm, since I will be using it in multiple projects and this will simplify distribution. My goal is to get the functionality merged into backblaze-b2 itself. Once I publish this new module, I'd like to request a code review of it. After working through any concerns there, I can submit a PR to add the function to backblaze-b2.

@cdhowie
Copy link
Author

cdhowie commented Feb 18, 2020

The module is published. Feel free to leave code review comments here, or as issues on the module's repository.

https://www.npmjs.com/package/@gideo-llc/backblaze-b2-upload-any

@cdhowie
Copy link
Author

cdhowie commented Feb 20, 2020

We have been using the module in production for the last few days and have migrated several TB of data successfully, streaming it from an HTTP origin to B2. I have not found any instances of corruption yet. There has been a pretty even mixture of standard and large-file uploads.

@odensc
Copy link
Collaborator

odensc commented Mar 6, 2020

@cdhowie Sorry for the delay, the API looks good but hopefully I can take a look at the code over the weekend.

@cdhowie
Copy link
Author

cdhowie commented Mar 6, 2020

@odensc There's a few things that I may want to rewrite but I haven't had time, and what's there works pretty well.

The main large-file-upload routine is a bit complex. Replacing it with async iterables (async function*) would significantly improve readability of the source interface code, and -- assuming that we can add async as a dependency -- the provided mapLimit function is exactly what the primary large-file-upload loop does, and it works with async iterables. The code would be substantially more readable and maintainable, at the cost of adding a dependency.

@jlarmstrongiv
Copy link

I just added @gideo-llc/backblaze-b2-upload-any to my project and it helped simplify my code a lot. I would love to see this helpful method added in core 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants