Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random MultipartUpload -> RequestError -> "use of closed network connection" when uploading a lot of data to S3 #3406

Open
3 tasks done
segevfiner opened this issue Jul 2, 2020 · 11 comments · Fixed by #3476 or #3479
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@segevfiner
Copy link

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug
We have code that download and uploads S3 objects across different buckets (We can't use CopyObject/UploadPartCopy as it is done from public buckets that are out of our control).

On larger buckets, after some time passes, we randomly get an error that looks something like this:

MultipartUpload: upload multipart failed
    upload id: <snip>
caused by: RequestError: send request failed
caused by: Put https://<snip>.s3.amazonaws.com/some_object?partNumber=4&uploadId=<snip>: write tcp 172.20.0.57:54788->52.217.14.236:443: use of closed network connection

The SDK is configured, as per default, to retry requests (I think the default for S3 is 3 retries). But it might not be retrying for this specific error. At least I wasn't able to find a reference to it in the SDK code. (I think this is one of those errors that Go hasn't exported for unknown reasons golang/go#4373)

I'm not sure if this is an error that can/should occur sporadically and should be retried by the SDK or it arises from some race condition/bug somewhere in the SDK or Go.

Version of AWS SDK for Go?
v1.31.15

Version of Go (go version)?
go version go1.13.12 linux/amd64

To Reproduce (observed behavior)
https://github.com/segevfiner/s3-download-upload-stress

There is one MWE there to copy an object from one bucket over and over to another, and one that copies a bucket recursively to another.

Expected behavior
The entire copy process should work to the end and not crash midway with a random error, the SDK should retry internally for transient errors.

Additional context
Add any other context about the problem here.

@segevfiner segevfiner added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 2, 2020
@diehlaws diehlaws self-assigned this Jul 16, 2020
@diehlaws diehlaws removed the needs-triage This issue or PR still needs to be triaged. label Jul 16, 2020
@diehlaws
Copy link
Contributor

Hi @segevfiner, thanks for reaching out to us about this. It sounds like the SDK is attempting to re-use a connection that has been closed by S3. This should be solvable by implementing a custom HTTP client in your session's config with a Dialer.KeepAlive value lower than the default 30 seconds (or a negative value to disable keep-alives). Please do let us know if you continue to see this behavior when using a custom keep-alive value on the HTTP client used by the SDK for your S3 calls.

@diehlaws diehlaws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jul 16, 2020
@github-actions
Copy link

This issue has not recieved a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jul 24, 2020
@segevfiner
Copy link
Author

segevfiner commented Jul 24, 2020

(Commenting to stop auto close)

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jul 25, 2020
@praneetloke
Copy link

@diehlaws I followed your recommendation for setting the keep-alive interval value to less than 30s, but I still see the use of closed network connection error from the SDK during uploads. I followed the recommendation from the link you provided in your comment for creating custom HTTP clients.

Here's the relevant snippet. Note that the one difference is that I also set the Timeout property of the http.Client{} type (ClientTimeout setting in the below snippet) in addition to the settings on the http.Transport type.

// The values specified here are the default values defined in the net/http
// package's DefaultTransport instance, except where noted.
httpClient, err := NewHTTPClientWithSettings(HTTPClientSettings{
        // The HTTP Client total timeout value for the request.
	ClientTimeout: 30 * time.Second,

	Connect:               30 * time.Second,
        // Set the keep-alive interval to less than 30s.
	ConnKeepAlive:    10 * time.Second,
	ExpectContinue:   1 * time.Second,
	IdleConn:              30 * time.Second,

	MaxAllIdleConns:    100,
	MaxHostIdleConns: 2,

	ResponseHeader:   5 * time.Second,
	TLSHandshake:      10 * time.Second,
})

@matthewswain
Copy link

I've been looking at this as I've experienced the same behaviour.

It looks to me like:

  • When the "use of a closed network connection" error is encountered, it's wrapped with awserr.New using ErrCodeRequestError (see here)
  • This bubbles up to be evaluated by shouldRetryError in the retryer
    (see here)
  • Which in turn passes it to isNestedErrorRetryable (see here)
  • isNestedErrorRetryable doesn't seem to have provision for this particular error being considered retryable

The original error is defined here. Perhaps isNestedErrorRetryable should be returning true in this instance?

@matthewswain
Copy link

Scratch that. Adding the following, passing test case to TestIsErrorRetryable in aws/request/retryer_test.go seems to invalidate the theory that shouldRetryError returns false.

{
	Err:       awserr.New(ErrCodeRequestError, "send request failed", errors.New("use of closed network connection")),
	Retryable: true,
},

@jschaf
Copy link

jschaf commented Aug 7, 2020

We see this error multiple times per day uploading Postgres backups to S3 with many TBs of data. We have a failure rate of about 5% of our backups. The error manifests through wal-g, a Go library that uploads a Postgres backup to S3. Here's the wal-G stack trace:

The AWS sdk used in our version is v1.26.1.

ERROR: 2020/08/06 08:39:52.198782 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2:
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198805 upload: could not upload 'base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br'
ERROR: 2020/08/06 08:39:52.198818 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198833 Unable to complete uploads

jschaf added a commit to jschaf/wal-g that referenced this issue Aug 7, 2020
When uploading large amounts of data to S3, we occasionally see failures where the AWS sdk tries to use a closed network connection. The upstream bug appears to be aws/aws-sdk-go#3406.  I'm not sure why the error manifests but it's causing us significant pain. Rather than retry the entire base backup, we'll retry the WAL segment upload.

```
ERROR: 2020/08/06 08:39:52.198782 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2:
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198805 upload: could not upload 'base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br'
ERROR: 2020/08/06 08:39:52.198818 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198833 Unable to complete uploads
```
aws-sdk-go-automation pushed a commit that referenced this issue Aug 12, 2020
===

### Service Client Updates
* `service/cloud9`: Updates service API and documentation
  * Add ConnectionType input parameter to CreateEnvironmentEC2 endpoint. New parameter enables creation of environments with SSM connection.
* `service/comprehend`: Updates service documentation
* `service/ec2`: Updates service API and documentation
  * Introduces support for IPv6-in-IPv4 IPsec tunnels. A user can now send traffic from their on-premise IPv6 network to AWS VPCs that have IPv6 support enabled.
* `service/fsx`: Updates service API and documentation
* `service/iot`: Updates service API, documentation, and paginators
  * Audit finding suppressions: Device Defender enables customers to turn off non-compliant findings for specific resources on a per check basis.
* `service/lambda`: Updates service API and examples
  * Support for creating Lambda Functions using 'java8.al2' and 'provided.al2'
* `service/transfer`: Updates service API, documentation, and paginators
  * Adds security policies to control cryptographic algorithms advertised by your server, additional characters in usernames and length increase, and FIPS compliant endpoints in the US and Canada regions.
* `service/workspaces`: Updates service API and documentation
  * Adds optional EnableWorkDocs property to WorkspaceCreationProperties in the ModifyWorkspaceCreationProperties API

### SDK Enhancements
* `codegen`: Add XXX_Values functions for getting slice of API enums by type.
  * Fixes [#3441](#3441) by adding a new XXX_Values function for each API enum type that returns a slice of enum values, e.g `DomainStatus_Values`.
* `aws/request`: Update default retry to retry "use of closed network connection" errors ([#3476](#3476))
  * Fixes [#3406](#3406)

### SDK Bugs
* `private/protocol/json/jsonutil`: Fixes a bug that truncated millisecond precision time in API response to seconds. ([#3474](#3474))
  * Fixes [#3464](#3464)
  * Fixes [#3410](#3410)
* `codegen`: Export event stream constructor for easier mocking ([#3473](#3473))
  * Fixes [#3412](#3412) by exporting the operation's EventStream type's constructor function so it can be used to fully initialize fully when mocking out behavior for API operations with event streams.
* `service/ec2`: Fix max retries with client customizations ([#3465](#3465))
  * Fixes [#3374](#3374) by correcting the EC2 API client's customization for ModifyNetworkInterfaceAttribute and AssignPrivateIpAddresses operations to use the aws.Config.MaxRetries value if set. Previously the API client's customizations would ignore MaxRetries specified in the SDK's aws.Config.MaxRetries field.
aws-sdk-go-automation added a commit that referenced this issue Aug 12, 2020
Release v1.34.3 (2020-08-12)
===

### Service Client Updates
* `service/cloud9`: Updates service API and documentation
  * Add ConnectionType input parameter to CreateEnvironmentEC2 endpoint. New parameter enables creation of environments with SSM connection.
* `service/comprehend`: Updates service documentation
* `service/ec2`: Updates service API and documentation
  * Introduces support for IPv6-in-IPv4 IPsec tunnels. A user can now send traffic from their on-premise IPv6 network to AWS VPCs that have IPv6 support enabled.
* `service/fsx`: Updates service API and documentation
* `service/iot`: Updates service API, documentation, and paginators
  * Audit finding suppressions: Device Defender enables customers to turn off non-compliant findings for specific resources on a per check basis.
* `service/lambda`: Updates service API and examples
  * Support for creating Lambda Functions using 'java8.al2' and 'provided.al2'
* `service/transfer`: Updates service API, documentation, and paginators
  * Adds security policies to control cryptographic algorithms advertised by your server, additional characters in usernames and length increase, and FIPS compliant endpoints in the US and Canada regions.
* `service/workspaces`: Updates service API and documentation
  * Adds optional EnableWorkDocs property to WorkspaceCreationProperties in the ModifyWorkspaceCreationProperties API

### SDK Enhancements
* `codegen`: Add XXX_Values functions for getting slice of API enums by type.
  * Fixes [#3441](#3441) by adding a new XXX_Values function for each API enum type that returns a slice of enum values, e.g `DomainStatus_Values`.
* `aws/request`: Update default retry to retry "use of closed network connection" errors ([#3476](#3476))
  * Fixes [#3406](#3406)

### SDK Bugs
* `private/protocol/json/jsonutil`: Fixes a bug that truncated millisecond precision time in API response to seconds. ([#3474](#3474))
  * Fixes [#3464](#3464)
  * Fixes [#3410](#3410)
* `codegen`: Export event stream constructor for easier mocking ([#3473](#3473))
  * Fixes [#3412](#3412) by exporting the operation's EventStream type's constructor function so it can be used to fully initialize fully when mocking out behavior for API operations with event streams.
* `service/ec2`: Fix max retries with client customizations ([#3465](#3465))
  * Fixes [#3374](#3374) by correcting the EC2 API client's customization for ModifyNetworkInterfaceAttribute and AssignPrivateIpAddresses operations to use the aws.Config.MaxRetries value if set. Previously the API client's customizations would ignore MaxRetries specified in the SDK's aws.Config.MaxRetries field.
@diehlaws diehlaws removed their assignment Aug 26, 2020
@rowandh
Copy link

rowandh commented Oct 19, 2020

I think this issue may still exist on Windows machines. I'm using https://github.com/peak/s5cmd with aws-sdk-go v1.34.12 built for Windows. Everything is working fine but I receive this error very often: wsarecv: an existing connection was forcibly closed by the remote host.

It seems likely that this error message is unique to the Windows TCP/IP stack's WSAECONNRESET error code - see https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2. Because it's Windows-specific, it's being missed by the error string matching which only checks for a *nix message.

I'm not familiar with how Go handles error messages on different platforms so it's possible I'm beating down the wrong path with this. However adding additional handling for this error message in the retry logic seemed to fix the issue for me.

@larytet
Copy link

larytet commented Mar 13, 2021

Happens in a Linux container
write tcp my_ip:51294->s3_ip:443: use of closed network connection
Should I upgrade the AWS SDK?

brili added a commit to prometheusresearch/rex_deliver_dataset that referenced this issue Mar 10, 2022
ajinkyagadewar added a commit to prometheusresearch/rex_deliver_dataset that referenced this issue Mar 11, 2022
@lucix-aws lucix-aws reopened this Mar 14, 2023
@RanVaknin RanVaknin added the p3 This is a minor priority issue label Mar 27, 2023
@bartdenotte
Copy link

We are recently seeing more occurrences of this issue. Is there a reason why this is ticket is open again?

@AMZN-hgoffin
Copy link

AMZN-hgoffin commented Feb 28, 2024

I believe the problem here is that the AWS SDK for Go is
1/ not forcing a new, non-reused connection for non-idempotent requests as HTTP spec strongly suggests (and which most browsers implement)
2/ is not treating connection closure as a retriable error for idempotent requests

The workaround is to disable HTTP transport keep-alive (which is NOT the same thing as Dialer keep-alive, that is unrelated to the problem), or set the idle connection timeout very short on the client side in the hope that you never get close to the actual server timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet