Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add storage.writer mock and error-injection tests #31

Closed
wants to merge 1 commit into from

Conversation

hphilipps
Copy link

@agneum @x4m
This adding a storage.Writer mock which can be used to inject errors and some tests.

The tests are showing, that the gcs storage client retries are working fine but giving up after around 30s. After that the uploader retries should kick in, but they always return 503 errors, which probably comes from trying to use the same writer again, which already failed before. This also reflects what we saw in our real-world tests.

I think the uploader code needs to be refactored to inject some dependencies, to make it easier to test and to be able to create a new writer before retrying within uploader.uploadChunk.

I pushed this PR up to share the error injection tests already. I will try to work on fixing the retry code tomorrow.

@NikolayS
Copy link

Thanks, @hphilipps!

@agneum
Copy link

agneum commented Sep 30, 2020

@hphilipps
I suppose we might need to update a writer after several failed retries, like that master...agneum:gcs-optimize-uploader

What kind of refactoring do you mean? I believe using the io.WriteCloser interface instead of *storage.Writer could simplify testing. I can prepare a new PR with these improvements. Am I missing something else?

@hphilipps hphilipps mentioned this pull request Sep 30, 2020
@hphilipps
Copy link
Author

@agneum
I think the approach with chunking the file in 20MiB parts and only retrying the current part can not work when we create a new writer, because a new writer will always start a new resumable upload from scratch. I created #32 with a simpler approach - the tests show retries working there. Would be glad if you could review.

@agneum
Copy link

agneum commented Oct 1, 2020

@hphilipps
I agree with that. It seems we cannot retrieve upload_id and reuse it for a new writer.
I traced the full call stack wal-g/wal-g -> wal-g/storage -> cloud.google.com/go/storage -> google.golang.org/api and found interesting facts about resumable uploads and a retry policy.

As the docs(https://cloud.google.com/storage/docs/resumable-uploads) describe, we can control the minimum size for performing resumable uploads with Writer.ChunkSize. By default, the GCS API library performs chunked resumable uploads. So, we can expect automatic splitting data and retry of requests.

I replaced the response with a 503 code and an empty body for uploading requests (https://github.com/googleapis/google-api-go-client/blob/master/internal/gensupport/send.go#L138) and found that the library retries resumable requests according to the backoff policy. It's a bit weird why these retries don't happen in real life.

I guess it could be possible if GCS returns a slightly different result from I have tested. I afraid the synthetic test and mocks could also give a distorted result.

Of course, we could build our own client that sends resumable upload requests directly, but I don't see any explicit mistakes for the client library.

@hphilipps
Copy link
Author

@agneum
We were doing chunked resumable uploads with retries all the time with the storage.Writer and they are working as expected.

The default for the storage.Writer, if you don't set a Chunksize, is to do chunked uploads with 16MiB chunks.
And my tests with injecting 503 errors are showing nicely that this is happening. This is transparent to the user of a storage.Writer of course.
But because of the hardcoded retryDeadline of 32s, the storage.Writer is retrying for up to 32s only and after that gives up and can not be reused - and this is what is happening to us in production.

Have a look at the debug output of running upload_test.go in #32 . I added the mock writer specifically to test with injecting errors - this was really helpful for debugging the issues and I encourage you to take over those test.

This is output for a test where I inject a 503 for the first chunk, so you can see the first Content-Range header repeated because of retrying the first 16MiB chunk and then going on until all 40MiB have been uploaded:

[...]
cRange:  0-16777215/*
cRange:  0-16777215/*
cRange:  16777216-33554431/*
cRange:  33554432-41943039/41943040
[...]

If we could get an option for configuring more than 32s as retryDeadline we wouldn't need to work on our own retries here.
Have a look at my comment: #32 (comment)

@hphilipps
Copy link
Author

Closing, as the retry issues mostly have been fixed with #33 and the following PRs.

@hphilipps hphilipps closed this Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants