lib-storage Upload does not retry if one part fails. #2311

alvaromat · 2021-04-28T15:38:54Z

Describe the bug

I'm using Upload from lib-storage to get advantage of multipart uploads regarding large files. I would expect it to retry in case any of failure uploading any part. Imagine we have a 5TB file that is splitted in 5MB parts. That would result in many, many parts, so the chance of individual failure is high.

There is the leavePartsOnError option. Setting it to true interrupts the whole upload if any part fails.

Your environment

SDK version number

@aws-sdk/lib-storage@3.13.1

Is the issue in the browser/Node.js/ReactNative?

Browser

Details of the browser/Node.js/ReactNative version

  Browsers:
    Chrome: 90.0.4430.85
    Firefox: 78.0.2
    Safari: 14.0.3

Steps to reproduce

Upload a file using Upload, exactly like in the example:

       try {
          const upload = new Upload({
            client: s3,
            params: uploadParams,
            leavePartsOnError: false
          })

          upload.on('httpUploadProgress', progress => {
            console.log(progress)
          })

          const uploadResult = await upload.done()
          console.log('done uploading', uploadResult)
        }
        catch (err) {
          showErrors('There was an error uploading your photo: ', err)
        }

Observed behavior

If there are no errors with any part, the upload goes right.
If there is an error in one or more parts, that part is skipped and the final file is built without those parts. This results in a corrupt file. This can be reproduced stopping the connection (turning WiFi off) or using browser DevTools to stop the connection.

Expected behavior

If a single part fails to upload, what is likely to happen with big files in mobile environments, just retry to send that part.
This could be done for example here:

aws-sdk-js-v3/lib/lib-storage/src/Upload.ts

Lines 124 to 130 in 28008c2

    
           } catch (e) { 
        
             // on leavePartsOnError throw an error so users can deal with it themselves, 
        
             // otherwise swallow the error. 
        
             if (this.leavePartsOnError) { 
        
               throw e; 
        
             } 
        
           }

Do not send the final build command if any part is missing:

aws-sdk-js-v3/lib/lib-storage/src/Upload.ts

Lines 152 to 162 in 28008c2

    
           this.uploadedParts.sort((a, b) => a.PartNumber! - b.PartNumber!); 
        
           const completeMultipartUpload = await this.client.send( 
        
             new CompleteMultipartUploadCommand({ 
        
               ...this.params, 
        
               UploadId: this.uploadId, 
        
               MultipartUpload: { 
        
                 Parts: this.uploadedParts, 
        
               }, 
        
             }) 
        
           );

The text was updated successfully, but these errors were encountered:

alexforsyth · 2021-05-10T16:13:03Z

@alvaromat you bring up a really good point. I agree that this is a much better experience. However - this isn't how upload is designed today so I'll change this to a feature request. As you point out, this isn't a difficult change, and it could also be a good first PR for new contributors!

In the mean time I'd love to open it up for public discussion. I'm looking to get a sense of what the retry would look like. No right or wrong answers, just looking to get a sense of where people are at in their thinking.

Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?
Lets say multiple parts fail - do we gate the total number of retries?
How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

alexforsyth · 2021-05-10T16:14:07Z

Assigning @ajredniwja to support the discussion

alvaromat · 2021-05-17T08:38:18Z

Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?

Lets say multiple parts fail - do we gate the total number of retries?

How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

The simplest version would just retry multiple times with a backoff. I'd set a max number of configurable retries per part, as multiple parallel parts can fail at the same time.
The backoff could grow in a fibonacci schema to avoid fast failing: first failure waits 3 seconds, next waits 5, next 8...
With a default number of retries of 5, it would wait for more than one minute before failing.

It's also possible to get advantage of the offline/online events. This could be useful for long offline events that could happen in low connection environments, like getting into the subway.
For example, if the offline event is fired, we don't start new requests until the online event is triggered.
There is also the navigator.onLine feature, supported also by older browsers.

I'd apply this retry logic to only parts uploads, not in the initial request where configuration issues are more likely to happen.

mkrn · 2021-09-14T18:25:09Z

Retry uploads of failed multiparts apparently was a feature of AWS SDK v2. Now with lib-storage it doesn't have it, so it's a regression.

Also, here's what AWS recommends:

Enable a retry mechanism in the application making requests

Because of the distributed nature of Amazon S3, requests that return 500 or 503 errors can be retried. It's a best practice to build retry logic into applications that make requests to Amazon S3. We recommend that you set the retry count for your applications to 10.

All AWS SDKs have a built-in retry mechanism with an algorithm that uses exponential backoff. This algorithm implements increasingly longer wait times between retries for consecutive error responses. Most exponential backoff algorithms use jitter (randomized delay) to prevent successive collisions. For more information, see Error Retries and Exponential Backoff in AWS.

There are 500 errors from time to time when uploading multi-part files and currently, none of JS AWS maintained libraries handle them properly. Quick search shows how common and frustrating those are with no solution offered: https://www.google.com/search?q=s3+multipart+upload+error+500

@alvaromat I would argue not to use navigator.onLine, as then it would leave ability to use this code in different environments, not just browser (next.js, react-native(web), node, etc.)

All that's needed to retry is just retry failed parts for maxRetryAttempts with a delay of a second or so.. Since it's retrying individual parts to recover from an occasional error 500, or to survive a short drop of connectivity, the exponential backoff is really not required.

mifi · 2021-10-10T06:09:19Z

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

ThompsonGitHub · 2022-01-27T00:04:20Z

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

@mifi This has also been my experience of using the @aws-sdk/lib-storage upload method with leavePartsOnError=false. It can result in corrupt binaries with missing parts.

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

I agree with @mifi that the default should be leavePartsOnError=true as from what @mifi says by letting upload throw the error it will be caught by the S3 client which will perform the retry. Even without the client retry leavePartsOnError=true as default is better because at least the exception gets thrown so you know that parts are failing and can retry the whole file upload rather than getting corrupt files from what appear to be successful upload executions.

If this is going to be the behaviour going forward a rename of the leavePartsOnError property or/and a rewording of the example on https://github.com/aws/aws-sdk-js-v3/tree/main/lib/lib-storage leavePartsOnError: false, // optional manually handle dropped parts would be welcomed as I find the current documented example confusing/misleading.

danielvouch · 2022-09-13T10:14:32Z

Has there been any update on this - is there a way to actually manually handle the dropped parts?

lukerlv · 2023-02-07T02:38:20Z

hello, Is there any progress on this feature?

mifi · 2023-02-07T09:46:16Z

created a PR #4414

lukerlv · 2023-02-07T09:58:38Z

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

@mifi hey man，can u show me the code? I don't know how to write, thank you very much!

mifi · 2023-02-07T10:45:22Z

not sure which code? just set leavePartsOnError: true

lukerlv · 2023-02-16T01:43:17Z

not sure which code? just set leavePartsOnError: true

ok, thank you @mifi

Fixes electron#3166 See also: aws/aws-sdk-js-v3#2311

macdja38 · 2023-03-27T21:36:52Z

@RanVaknin Is there any way to get this assigned a higher priority?

While there is a workaround available, it's not obvious that it needs to be used until something fails in a potentially catastrophic way, and teams spend potentially hours investigating.

The documentation pointed to above suggests to a user reading it that the default behaviour is the safe one with retries.

Fixes #3166 See also: aws/aws-sdk-js-v3#2311

0zcl · 2023-09-26T08:27:00Z

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

after set "leavePartsOnError: true" in @aws-sdk/lib-storage. how to set retry ? I want a part retry upload 3 times when upload fail. hope you anwser @mifi

mifi · 2023-09-26T09:19:50Z

I think it will auto retry, see #2311 (comment)

0zcl · 2023-09-26T10:04:23Z

I think it will auto retry, see #2311 (comment)

when i upload video using @aws-sdk/lib-storage and set "leavePartsOnError: true". During the upload process， i set network unconnect. as follow image you see, only 1 error " partNumber=7 upload fail " showing on network. if client-s3 will retry, may be i can see more than 1 error from network。hope you anwser@mifi

lukerlv · 2023-11-13T10:18:18Z

@0zcl hey man, have you solved the problem?

0zcl · 2023-11-13T10:58:16Z

@0zcl嘿，伙计，你解决问题了吗？

no, I can not set retry times. Although does not affect the use

glebbars · 2024-01-29T15:53:22Z

I also set Offline in the throttling drop-down to test the multi-upload retries.

With leavePartsOnError: true, uploading the file fails and doesn't retry any part. If set to false (default), the error occurs while analyzing part of the file since it is corrupted (because one of the parts is absent).

@RanVaknin Is it possible to either retry uploading the failed part or the whole upload?
I am using @aws-sdk/client-s3: ^3.418.0, @aws-sdk/lib-storage: 3.501.0

kuhe · 2024-05-21T15:32:00Z

The Upload class uses an S3 client as one of its inputs. This client has a configured retry strategy by default that includes up to 3 attempts.

Because Upload splits the input into buffered chunks, even if the original input is a stream, these UploadPart requests will be retried according to the retryStrategy of the client.

You do not need to modify the retry configuration because the default will retry, but here is some information:

Once #6112 is merged, the only reason to set leavePartsOnError = true will be if you don't want a failed upload to mark the leftover parts as aborted on S3 to save on storage.

RanVaknin · 2024-05-23T18:57:55Z

Since this is addressed in the PR I'm going to close this. Since closed issues are not as closely monitored, if you have any additional concerns about this, please create a new issue.

Thanks,
Ran~

alvaromat added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 28, 2021

ajredniwja assigned alexforsyth Apr 29, 2021

alexforsyth assigned ajredniwja May 10, 2021

alexforsyth added feature-request New feature or enhancement. May require GitHub community feedback. needs-discussion and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 10, 2021

alexforsyth removed their assignment May 10, 2021

alexforsyth added the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label May 10, 2021

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label May 18, 2021

kukhariev mentioned this issue Sep 3, 2022

fix: validate s3 response Chocobozzz/PeerTube#5231

Merged

4 tasks

This was referenced Feb 7, 2023

@uppy/companion: switch from aws-sdk v2 to @aws-sdk/client-s3 (v3) transloadit/uppy#4285

Merged

fix(s3): make leavePartsOnError default to true #4414

Closed

macdja38 mentioned this issue Feb 16, 2023

S3 uploads are sometimes corrupted. (missing chunks) electron/forge#3166

Closed

3 tasks

RanVaknin added the p3 This is a minor priority issue label Feb 18, 2023

cpmsmith added a commit to cpmsmith/forge that referenced this issue Mar 24, 2023

fix(publisher-s3): Make S3 upload fail non-silently

0ba60c9

Fixes electron#3166 See also: aws/aws-sdk-js-v3#2311

cpmsmith mentioned this issue Mar 24, 2023

fix(publisher-s3): Make S3 upload fail non-silently electron/forge#3194

Merged

5 tasks

VerteDinde pushed a commit to electron/forge that referenced this issue Apr 4, 2023

fix(publisher-s3): Make S3 upload fail non-silently (#3194)

0f73d64

Fixes #3166 See also: aws/aws-sdk-js-v3#2311

RanVaknin assigned RanVaknin and unassigned ajredniwja Feb 28, 2024

RanVaknin added investigating Issue is being investigated and/or work is in progress to resolve the issue. and removed needs-discussion labels Feb 28, 2024

aBurmeseDev added the queued This issues is on the AWS team's backlog label Mar 20, 2024

RanVaknin added investigating Issue is being investigated and/or work is in progress to resolve the issue. and removed investigating Issue is being investigated and/or work is in progress to resolve the issue. labels May 9, 2024

kuhe added p2 This is a standard priority issue and removed p3 This is a minor priority issue labels May 20, 2024

kuhe self-assigned this May 20, 2024

RanVaknin closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib-storage Upload does not retry if one part fails. #2311

lib-storage Upload does not retry if one part fails. #2311

alvaromat commented Apr 28, 2021 •

edited

alexforsyth commented May 10, 2021

alexforsyth commented May 10, 2021

alvaromat commented May 17, 2021

mkrn commented Sep 14, 2021 •

edited

mifi commented Oct 10, 2021 •

edited

ThompsonGitHub commented Jan 27, 2022

danielvouch commented Sep 13, 2022

lukerlv commented Feb 7, 2023

mifi commented Feb 7, 2023

lukerlv commented Feb 7, 2023

mifi commented Feb 7, 2023

lukerlv commented Feb 16, 2023

macdja38 commented Mar 27, 2023 •

edited

0zcl commented Sep 26, 2023 •

edited

mifi commented Sep 26, 2023

0zcl commented Sep 26, 2023 •

edited

lukerlv commented Nov 13, 2023

0zcl commented Nov 13, 2023

glebbars commented Jan 29, 2024 •

edited

kuhe commented May 21, 2024

RanVaknin commented May 23, 2024

lib-storage Upload does not retry if one part fails. #2311

lib-storage Upload does not retry if one part fails. #2311

Comments

alvaromat commented Apr 28, 2021 • edited

Describe the bug

Your environment

SDK version number

Is the issue in the browser/Node.js/ReactNative?

Details of the browser/Node.js/ReactNative version

Steps to reproduce

Observed behavior

Expected behavior

alexforsyth commented May 10, 2021

alexforsyth commented May 10, 2021

alvaromat commented May 17, 2021

mkrn commented Sep 14, 2021 • edited

mifi commented Oct 10, 2021 • edited

ThompsonGitHub commented Jan 27, 2022

danielvouch commented Sep 13, 2022

lukerlv commented Feb 7, 2023

mifi commented Feb 7, 2023

lukerlv commented Feb 7, 2023

mifi commented Feb 7, 2023

lukerlv commented Feb 16, 2023

macdja38 commented Mar 27, 2023 • edited

0zcl commented Sep 26, 2023 • edited

mifi commented Sep 26, 2023

0zcl commented Sep 26, 2023 • edited

lukerlv commented Nov 13, 2023

0zcl commented Nov 13, 2023

glebbars commented Jan 29, 2024 • edited

kuhe commented May 21, 2024

RanVaknin commented May 23, 2024

alvaromat commented Apr 28, 2021 •

edited

mkrn commented Sep 14, 2021 •

edited

mifi commented Oct 10, 2021 •

edited

macdja38 commented Mar 27, 2023 •

edited

0zcl commented Sep 26, 2023 •

edited

0zcl commented Sep 26, 2023 •

edited

glebbars commented Jan 29, 2024 •

edited