Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcp: make per-chunk retry upload timeout configurable #80474

Merged
merged 1 commit into from Sep 9, 2022

Conversation

adityamaru
Copy link
Contributor

This change adds a cluster setting cloudstorage.gs.chunking.retry_timeout
that can be used to change the default per-chunk retry timeout
that GCS imposes when chunking of file upload is enabled. The default
value is set to 60 seconds, which is double of the default google sdk
value of 30s.

This change was motivated by sporadic occurrences of a 503 service unavailable
error during backups. On its own this change is not expected to solve the
resiliency issues of backup when the upload service is unavailable, but it
is nice to have configurable setting nonetheless.

Release note (sql change): cloudstorage.gs.chunking.retry_timeout
is a cluster setting that can be used to configure the per-chunk retry
timeout of files to Google Cloud Storage. The default value is 60 seconds.

@adityamaru adityamaru requested review from dt, stevendanna and a team April 25, 2022 13:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

adityamaru commented Apr 25, 2022

Open to discussing this PR more since the 60-second value was chosen unscientifically. Our backups target writing 128MB files that are then chunked into the default 16MB chunks by gcs. This comment googleapis/google-api-go-client#685 (comment) was interesting, especially in the light of the throttling we are adding to external storage write paths.

If the write is destined to fail this will cause the backup to take longer before it reaches a failed state.

This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout`
that can be used to change the default per-chunk retry timeout
that GCS imposes when chunking of file upload is enabled. The default
value is set to 60 seconds, which is double of the default google sdk
value of 30s.

This change was motivated by sporadic occurrences of a 503 service unavailable
error during backups. On its own this change is not expected to solve the
resiliency issues of backup when the upload service is unavailable, but it
is nice to have configurable setting nonetheless.

Release note (sql change): `cloudstorage.gs.chunking.retry_timeout`
is a cluster setting that can be used to configure the per-chunk retry
timeout of files to Google Cloud Storage. The default value is 60 seconds.
@adityamaru
Copy link
Contributor Author

I want to revisit checking this in in light of increased 503 errors - #87480. We've now dropped our chunk size to 8<<20 as of #80668 but maybe 32 seconds is still too short in the face of 503 errors. I suggest we try bumping it to 60s and see if it reduces the frequency of such errors.

@adityamaru
Copy link
Contributor Author

TFTR!

bors r=dt

@adityamaru adityamaru added the backport-22.2.x Flags PRs that need to be backported to 22.2. label Sep 9, 2022
@craig
Copy link
Contributor

craig bot commented Sep 9, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-22.2.x Flags PRs that need to be backported to 22.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants