Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving to S3 stalls on "use of closed network connection" #656

Closed
dkent opened this issue May 22, 2020 · 9 comments
Closed

Archiving to S3 stalls on "use of closed network connection" #656

dkent opened this issue May 22, 2020 · 9 comments

Comments

@dkent
Copy link

dkent commented May 22, 2020

wal-g version 0.2.15
postgres version 12.3

Every few weeks, the wal archiving to S3 stalls after a failed archiving and does not resume on its own. Restarting the postgresql service resumes the archiving. Unsure if this is wal-g or a postgres archiver issue. At the time, I was performing some large updates, so the rate of wal file generation was high, with archiving occurring multiple times per second, as opposed to 1-2x/minute.

Using standard archive command:
wal-g wal-push $1

You'll notice in logs that the wal archiving to S3 was working fine and then failed on 2020/05/21 17:22:46. Nothing occurred in the logs until I restarted the service at 2020-05-22 07:29:52. At this point, the pg_wal directory had about 43GB of files ready to push to S3, which it immediately did after restart. Thanks in advance!

INFO: 2020/05/21 17:22:40.735503 FILE PATH: 0000000100000865000000B8.br
INFO: 2020/05/21 17:22:40.906175 FILE PATH: 0000000100000865000000B7.br
INFO: 2020/05/21 17:22:40.955926 FILE PATH: 0000000100000865000000BA.br
INFO: 2020/05/21 17:22:41.110995 FILE PATH: 0000000100000865000000B9.br
ERROR: 2020/05/21 17:22:46.534246 failed to upload 'postgresql-backups/wal_005/00000001000008650000009A.br' to bucket 'c38-ops': RequestError: send request failed
caused by: Put https://c38-ops.s3.us-west-2.amazonaws.com/postgresql-backups/wal_005/00000001000008650000009A.br: write tcp 10.0.0.92:13268->52.218.248.17:443: use of closed network connection
INFO: 2020/05/21 17:22:46.534267 FILE PATH: 00000001000008650000009A.br
ERROR: 2020/05/21 17:22:46.534281 Error of background uploader: upload: could not Upload 'pg_wal/00000001000008650000009A'
: failed to upload 'postgresql-backups/wal_005/00000001000008650000009A.br' to bucket 'c38-ops': RequestError: send request failed
caused by: Put https://c38-ops.s3.us-west-2.amazonaws.com/postgresql-backups/wal_005/00000001000008650000009A.br: write tcp 10.0.0.92:13268->52.218.248.17:443: use of closed network connection
2020-05-22 07:29:52.972 PDT [9933] LOG:  received fast shutdown request
2020-05-22 07:29:52.975 PDT [9933] LOG:  aborting any active transactions
2020-05-22 07:29:52.976 PDT [10118] postgres@postgres FATAL:  terminating connection due to administrator command
2020-05-22 07:29:52.978 PDT [30478] postgres@circle_prod FATAL:  terminating connection due to administrator command
2020-05-22 07:29:52.979 PDT [9933] LOG:  background worker "logical replication launcher" (PID 9941) exited with exit code 1
2020-05-22 07:29:54.630 PDT [9935] LOG:  shutting down
2020-05-22 07:30:53.061 PDT [9939] FATAL:  archive command was terminated by signal 15: Terminated
2020-05-22 07:30:53.061 PDT [9939] DETAIL:  The failed archive command was: /var/lib/postgresql/wal_backup.sh pg_wal/00000001000008650000008D
2020-05-22 07:30:53.062 PDT [9933] LOG:  archiver process (PID 9939) exited with exit code 1
2020-05-22 07:30:53.682 PDT [9933] LOG:  database system is shut down
2020-05-22 07:30:53.779 PDT [13276] LOG:  starting PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
2020-05-22 07:30:53.779 PDT [13276] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-05-22 07:30:53.779 PDT [13276] LOG:  listening on IPv6 address "::", port 5432
2020-05-22 07:30:53.780 PDT [13276] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-22 07:30:53.899 PDT [13277] LOG:  database system was shut down at 2020-05-22 07:29:54 PDT
2020-05-22 07:30:53.905 PDT [13276] LOG:  database system is ready to accept connections
INFO: 2020/05/22 07:30:56.255094 FILE PATH: 0000000100000865000000BE.br
INFO: 2020/05/22 07:30:56.255172 FILE PATH: 0000000100000865000000BC.br
INFO: 2020/05/22 07:30:56.255218 FILE PATH: 0000000100000865000000C5.br
INFO: 2020/05/22 07:30:56.255379 FILE PATH: 0000000100000865000000BF.br
INFO: 2020/05/22 07:30:56.255386 FILE PATH: 00000001000008650000008D.br
@bunsenmcdubbs
Copy link
Contributor

I've noticed the same S3 error in my backup-push commands... use of closed network connection

@jschaf
Copy link

jschaf commented Aug 7, 2020

The root cause might be aws/aws-sdk-go#3406. In my case, it manifests for about 5% of our nightly backups. Here's the wal-g stack trace:

ERROR: 2020/08/06 08:39:52.198782 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2:
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198805 upload: could not upload 'base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br'
ERROR: 2020/08/06 08:39:52.198818 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET':
    MultipartUpload: upload multipart failed
caused by: RequestError: send request failed
caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2
    write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection
ERROR: 2020/08/06 08:39:52.198833 Unable to complete uploads

@Tinsane
Copy link
Collaborator

Tinsane commented Aug 13, 2020

Regardless of the root cause, I strongly recommend to set timeout for wal-g commands. It doesn't break wal-g or database state and it resolves any problems with stalling:
timeout 1337 wal-g wal-push $1

@alexmv
Copy link

alexmv commented Nov 20, 2020

We've run into this a couple times. Any signal on if the aws-sdk-go version bump addresses this, and thoughts on when we might see a full release with that?

alexmv added a commit to alexmv/zulip that referenced this issue Nov 20, 2020
`wal-g wal-push` has a known bug with hanging after file upload to
S3[1]; set a rather long timeout on the upload process, so that we
don't simply stall forever when archiving WAL segments.

[1] wal-g/wal-g#656
alexmv added a commit to alexmv/zulip that referenced this issue Nov 20, 2020
`wal-g wal-push` has a known bug with occasionally hanging after file
upload to S3[1]; set a rather long timeout on the upload process, so
that we don't simply stall forever when archiving WAL segments.

[1] wal-g/wal-g#656
timabbott pushed a commit to zulip/zulip that referenced this issue Nov 20, 2020
`wal-g wal-push` has a known bug with occasionally hanging after file
upload to S3[1]; set a rather long timeout on the upload process, so
that we don't simply stall forever when archiving WAL segments.

[1] wal-g/wal-g#656
@x4m
Copy link
Collaborator

x4m commented Nov 30, 2020

@alexmv yeah, I think it's time for the release. Anyway, the current behavior of S3 storage is much better now.

@bunsenmcdubbs
Copy link
Contributor

I can confirm our (I work with @jschaf) AWS S3 issues largely went away after we started using the pre-release which includes the aws-sdk-go version bump. Apologies for not reporting back earlier. @x4m I'd strongly recommend cutting a new release with this update - it dramatically reduced our overall error rate.

@x4m
Copy link
Collaborator

x4m commented Nov 30, 2020

@bunsenmcdubbs and others let's compose release notes here
http://collabedit.com/4erab
There was a lot of cool new stuff added, but I remembered only some bullets.

@timabbott
Copy link

It looks like a release happened: https://github.com/wal-g/wal-g/releases/tag/v1.0

Does that mean this is fixed?

@x4m
Copy link
Collaborator

x4m commented Jul 15, 2021

Yeah, thanks for reminding!

If anyone is observing something similar - plz reopen or create a new issue.

@x4m x4m closed this as completed Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants