New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix retry #32
Fix retry #32
Conversation
I checked: there is no way to change the retryDeadline or backoff strategy of the storage.Writer - this is hardcoded in an internal package :-( https://github.com/googleapis/google-api-go-client/blob/master/internal/gensupport/resumable.go#L27 |
I opened an issue about the hardcoded |
Thinking about it - even when we would get the option to set a higher |
I created a PR to override |
That's great @agneum, thanks! |
@hphilipps With the approach implemented in this PR, keeping in mind that Postgres stores data in files not exceeding 1 GiB, right now, we are going to use N * 1 GiB of RAM (where N is concurrency level from the WAL-G configuration). As I understand, you're going to think about the next steps, to reduce memory consumption. What could it be?
Right? What do you think, how difficult this would be to implement? Wouldn't it require too many changes in WAL-G? (And potentially affecting other types of storages rather than GCS) |
Dunno, we do not observe memory consumption more than 1Gb, why would GCS storage require more? |
Actually usually we observe ~440Mb. Because netowork is throttled to the same bandwidth as disk, but we send compressed data, so there are not so much of buffers before netowork. |
@x4m @NikolayS To explain the issue again: The wal-g storages Changing this in the wal-g storages package seems impossible - io.Reader is used all over the place through the call-stack, which makes sense as we want to stream the data instead of buffering it. A retry would need to take place in wal-g at the place where we are accessing the file on disk for reading to get around buffering in memory. |
So, if errors of upload occur longer than 32 seconds we need to store whole file? |
@x4m Exactly, the current implementation of the google-go-api-client has So storage.Writer is failing after retrying the failed chunk for 32s and can't be re-used. And if we want to retry sending this file via a new storage.Writer we need to give it the same data again, which we only can do if we buffered the data within Of course we don't want to resent the whole file if possible. This is exactly why storage.Writer is doing a chunked upload internally and only retrying the failed chunks. But if it get's a 503 for longer than 32s, then possibly the GCS backend server doing the resumable upload session (all chunks for a session go always to the same backend I was told by a GCS engineer) died and the session is lost anyway and we need to start uploading the whole file again. |
In AWS we send multiparts which are persistent, not dependent on liveliness of server backend. |
@hphilipps @NikolayS @x4m
In my point of view, the most preferable look the last method with multipart uploading and object composition. I'm going to implement it. |
@agneum That's a great finding - composite objects sound like the perfect solution to avoid buffering too much in memory for retrying! Let me know if I can help in any way - I'm happy to review your solution. And I totally agree with your evaluation of the other possible solutions - they all seem either incomplete or very hard to accomplish. |
Special thanks to @x4m for guiding us today. |
Here is a PR #33 that implements multipart uploading. |
Closing in favour of #33. |
@agneum @x4m @NikolayS
This is building on #31 and fixing the gcs retry logic.
I simplified the retry logic by completely removing the Uploader. The Uploader was chunking files into 20MiB pieces and retrying each of the pieces when the writer failed. But this can not work:
When a writer.Write() finally fails (after doing it's own retries for around 30s), it seems not possible to reuse the same writer for the same upload anymore, so we need to create a new writer and try again. But when we create a new writer, the SessionID of the resumable upload is getting lost and a new writer will always write the file from scratch with offset 0 (in 16MiB chunks) to gcs, so we also need to start giving it the first 20MiB part again instead of trying to continue with the current part.
I simplified this by reading the whole file into a buffer and simply retrying the whole file. The downside of course is that we will use a lot of memory for big files, but there is no other way as long as
PutObject()
takes an io.Reader - it only can be read once and thus needs to be cached for retrying. If we want to avoid that, we would need to do retries higher up the stack in wal-g or change the signature ofPutObject
to take a file-handle that can be re-opened.If there would be a way to extend the gcs client retry parameters to retry longer than just for around 30s, we maybe wouldn't need all of this. But I couldn't find a way to configure retries for a
storage.Writer
so far.I will work on making chunksize a config option next. This is getting a bit complicated, because we will want different chunksizes for
wal-push
(small files) andbackup-push
(big files).