Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib-storage Upload does not retry if one part fails. #2311

Closed
alvaromat opened this issue Apr 28, 2021 · 21 comments
Closed

lib-storage Upload does not retry if one part fails. #2311

alvaromat opened this issue Apr 28, 2021 · 21 comments
Assignees
Labels
guidance General information and guidance, answers to FAQs, or recommended best practices/resources. p2 This is a standard priority issue

Comments

@alvaromat
Copy link

alvaromat commented Apr 28, 2021

Describe the bug

I'm using Upload from lib-storage to get advantage of multipart uploads regarding large files. I would expect it to retry in case any of failure uploading any part. Imagine we have a 5TB file that is splitted in 5MB parts. That would result in many, many parts, so the chance of individual failure is high.

There is the leavePartsOnError option. Setting it to true interrupts the whole upload if any part fails.

Your environment

SDK version number

@aws-sdk/lib-storage@3.13.1

Is the issue in the browser/Node.js/ReactNative?

Browser

Details of the browser/Node.js/ReactNative version

  Browsers:
    Chrome: 90.0.4430.85
    Firefox: 78.0.2
    Safari: 14.0.3

Steps to reproduce

Upload a file using Upload, exactly like in the example:

       try {
          const upload = new Upload({
            client: s3,
            params: uploadParams,
            leavePartsOnError: false
          })

          upload.on('httpUploadProgress', progress => {
            console.log(progress)
          })

          const uploadResult = await upload.done()
          console.log('done uploading', uploadResult)
        }
        catch (err) {
          showErrors('There was an error uploading your photo: ', err)
        }

Observed behavior

If there are no errors with any part, the upload goes right.
If there is an error in one or more parts, that part is skipped and the final file is built without those parts. This results in a corrupt file. This can be reproduced stopping the connection (turning WiFi off) or using browser DevTools to stop the connection.

Expected behavior

If a single part fails to upload, what is likely to happen with big files in mobile environments, just retry to send that part.
This could be done for example here:

} catch (e) {
// on leavePartsOnError throw an error so users can deal with it themselves,
// otherwise swallow the error.
if (this.leavePartsOnError) {
throw e;
}
}

Do not send the final build command if any part is missing:

this.uploadedParts.sort((a, b) => a.PartNumber! - b.PartNumber!);
const completeMultipartUpload = await this.client.send(
new CompleteMultipartUploadCommand({
...this.params,
UploadId: this.uploadId,
MultipartUpload: {
Parts: this.uploadedParts,
},
})
);

@alvaromat alvaromat added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 28, 2021
@alexforsyth
Copy link
Contributor

@alvaromat you bring up a really good point. I agree that this is a much better experience. However - this isn't how upload is designed today so I'll change this to a feature request. As you point out, this isn't a difficult change, and it could also be a good first PR for new contributors!

In the mean time I'd love to open it up for public discussion. I'm looking to get a sense of what the retry would look like. No right or wrong answers, just looking to get a sense of where people are at in their thinking.

  1. Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?
  2. Lets say multiple parts fail - do we gate the total number of retries?
  3. How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

@alexforsyth alexforsyth added feature-request New feature or enhancement. May require GitHub community feedback. needs-discussion and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 10, 2021
@alexforsyth alexforsyth removed their assignment May 10, 2021
@alexforsyth
Copy link
Contributor

Assigning @ajredniwja to support the discussion

@alexforsyth alexforsyth added the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label May 10, 2021
@alvaromat
Copy link
Author

  1. Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?
  2. Lets say multiple parts fail - do we gate the total number of retries?
  3. How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

The simplest version would just retry multiple times with a backoff. I'd set a max number of configurable retries per part, as multiple parallel parts can fail at the same time.
The backoff could grow in a fibonacci schema to avoid fast failing: first failure waits 3 seconds, next waits 5, next 8...
With a default number of retries of 5, it would wait for more than one minute before failing.

It's also possible to get advantage of the offline/online events. This could be useful for long offline events that could happen in low connection environments, like getting into the subway.
For example, if the offline event is fired, we don't start new requests until the online event is triggered.
There is also the navigator.onLine feature, supported also by older browsers.

I'd apply this retry logic to only parts uploads, not in the initial request where configuration issues are more likely to happen.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label May 18, 2021
@mkrn
Copy link

mkrn commented Sep 14, 2021

Retry uploads of failed multiparts apparently was a feature of AWS SDK v2. Now with lib-storage it doesn't have it, so it's a regression.

Also, here's what AWS recommends:

Enable a retry mechanism in the application making requests

Because of the distributed nature of Amazon S3, requests that return 500 or 503 errors can be retried. It's a best practice to build retry logic into applications that make requests to Amazon S3. We recommend that you set the retry count for your applications to 10.

All AWS SDKs have a built-in retry mechanism with an algorithm that uses exponential backoff. This algorithm implements increasingly longer wait times between retries for consecutive error responses. Most exponential backoff algorithms use jitter (randomized delay) to prevent successive collisions. For more information, see Error Retries and Exponential Backoff in AWS.

There are 500 errors from time to time when uploading multi-part files and currently, none of JS AWS maintained libraries handle them properly. Quick search shows how common and frustrating those are with no solution offered: https://www.google.com/search?q=s3+multipart+upload+error+500

@alvaromat I would argue not to use navigator.onLine, as then it would leave ability to use this code in different environments, not just browser (next.js, react-native(web), node, etc.)

All that's needed to retry is just retry failed parts for maxRetryAttempts with a delay of a second or so.. Since it's retrying individual parts to recover from an occasional error 500, or to survive a short drop of connectivity, the exponential backoff is really not required.

@mifi
Copy link

mifi commented Oct 10, 2021

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

@ThompsonGitHub
Copy link

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

@mifi This has also been my experience of using the @aws-sdk/lib-storage upload method with leavePartsOnError=false. It can result in corrupt binaries with missing parts.

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

I agree with @mifi that the default should be leavePartsOnError=true as from what @mifi says by letting upload throw the error it will be caught by the S3 client which will perform the retry. Even without the client retry leavePartsOnError=true as default is better because at least the exception gets thrown so you know that parts are failing and can retry the whole file upload rather than getting corrupt files from what appear to be successful upload executions.

If this is going to be the behaviour going forward a rename of the leavePartsOnError property or/and a rewording of the example on https://github.com/aws/aws-sdk-js-v3/tree/main/lib/lib-storage leavePartsOnError: false, // optional manually handle dropped parts would be welcomed as I find the current documented example confusing/misleading.

@danielvouch
Copy link

Has there been any update on this - is there a way to actually manually handle the dropped parts?

@lukerlv
Copy link

lukerlv commented Feb 7, 2023

hello, Is there any progress on this feature?

@mifi
Copy link

mifi commented Feb 7, 2023

created a PR #4414

@lukerlv
Copy link

lukerlv commented Feb 7, 2023

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

@mifi hey man,can u show me the code? I don't know how to write, thank you very much!

@mifi
Copy link

mifi commented Feb 7, 2023

not sure which code? just set leavePartsOnError: true

@lukerlv
Copy link

lukerlv commented Feb 16, 2023

not sure which code? just set leavePartsOnError: true

ok, thank you @mifi

@macdja38
Copy link

macdja38 commented Mar 27, 2023

@RanVaknin Is there any way to get this assigned a higher priority?

While there is a workaround available, it's not obvious that it needs to be used until something fails in a potentially catastrophic way, and teams spend potentially hours investigating.

The documentation pointed to above suggests to a user reading it that the default behaviour is the safe one with retries.

VerteDinde pushed a commit to electron/forge that referenced this issue Apr 4, 2023
@0zcl
Copy link

0zcl commented Sep 26, 2023

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

after set "leavePartsOnError: true" in @aws-sdk/lib-storage. how to set retry ? I want a part retry upload 3 times when upload fail. hope you anwser @mifi

@mifi
Copy link

mifi commented Sep 26, 2023

I think it will auto retry, see #2311 (comment)

@0zcl
Copy link

0zcl commented Sep 26, 2023

I think it will auto retry, see #2311 (comment)

when i upload video using @aws-sdk/lib-storage and set "leavePartsOnError: true". During the upload process, i set network unconnect. as follow image you see, only 1 error " partNumber=7 upload fail " showing on network. if client-s3 will retry, may be i can see more than 1 error from network。hope you anwser@mifi

image

@lukerlv
Copy link

lukerlv commented Nov 13, 2023

@0zcl hey man, have you solved the problem?

@0zcl
Copy link

0zcl commented Nov 13, 2023

@0zcl嘿,伙计,你解决问题了吗?

no, I can not set retry times. Although does not affect the use

@glebbars
Copy link

glebbars commented Jan 29, 2024

I also set Offline in the throttling drop-down to test the multi-upload retries.

With leavePartsOnError: true, uploading the file fails and doesn't retry any part. If set to false (default), the error occurs while analyzing part of the file since it is corrupted (because one of the parts is absent).

@RanVaknin Is it possible to either retry uploading the failed part or the whole upload?
I am using @aws-sdk/client-s3: ^3.418.0, @aws-sdk/lib-storage: 3.501.0

@RanVaknin RanVaknin assigned RanVaknin and unassigned ajredniwja Feb 28, 2024
@RanVaknin RanVaknin added investigating Issue is being investigated and/or work is in progress to resolve the issue. and removed needs-discussion labels Feb 28, 2024
@aBurmeseDev aBurmeseDev added the queued This issues is on the AWS team's backlog label Mar 20, 2024
@RanVaknin RanVaknin added investigating Issue is being investigated and/or work is in progress to resolve the issue. and removed investigating Issue is being investigated and/or work is in progress to resolve the issue. labels May 9, 2024
@kuhe kuhe added p2 This is a standard priority issue and removed p3 This is a minor priority issue labels May 20, 2024
@kuhe kuhe self-assigned this May 20, 2024
@kuhe
Copy link
Contributor

kuhe commented May 21, 2024

The Upload class uses an S3 client as one of its inputs. This client has a configured retry strategy by default that includes up to 3 attempts.

Because Upload splits the input into buffered chunks, even if the original input is a stream, these UploadPart requests will be retried according to the retryStrategy of the client.

You do not need to modify the retry configuration because the default will retry, but here is some information:

Once #6112 is merged, the only reason to set leavePartsOnError = true will be if you don't want a failed upload to mark the leftover parts as aborted on S3 to save on storage.

@kuhe kuhe added guidance General information and guidance, answers to FAQs, or recommended best practices/resources. and removed investigating Issue is being investigated and/or work is in progress to resolve the issue. queued This issues is on the AWS team's backlog feature-request New feature or enhancement. May require GitHub community feedback. labels May 21, 2024
@RanVaknin
Copy link
Contributor

Since this is addressed in the PR I'm going to close this. Since closed issues are not as closely monitored, if you have any additional concerns about this, please create a new issue.

Thanks,
Ran~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance General information and guidance, answers to FAQs, or recommended best practices/resources. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests