Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: object written to bucket even if Writer.Close fails because of context expiration #4453

Closed
nicholas-mainardi opened this issue Jul 16, 2021 · 4 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@nicholas-mainardi
Copy link

I am trying to implement a client application to upload big files to GCS buckets employing parallel composite uploads. To this extent, I split my big file in chunks and upload in parallel each chunk as a separate object, merging then all the chunks with a Composer. For each chunk, I create an instance of a Writer, each with its own context with a given timeout.

The problem in such a setting is that sometimes, even if the function Writer.Close fails with error context deadline exceeded, the object is still created in the bucket, which appears to me as an unexpected behavior. This issue seems to be triggered once in a while when some of the chunks being simultaneously uploaded are successfully written to the bucket before context expiration, while others are written even if the context has expired before completion of the Writer.Close function.

I tried to reproduce the issue with a simpler piece of code in my local environment:

Client

Storage

Environment
Linux 5.8.0-59-generic #66~20.04.1-Ubuntu - native installation

Go Environment

$ go version go1.16.3
$ go env GO111MODULE="auto", GOHOSTARCH="amd64", CGO_ENABLED="1"

Code

package main

import (
	"bytes"
	"cloud.google.com/go/storage"
	"context"
	"crypto/rand"
	"fmt"
	"google.golang.org/api/option"
	"io"
	"log"
	"os"
	"time"
)

const uploadTimeout = 33*time.Second

func putCopy(ctx context.Context, client *storage.Client, bucketName, objectKey string,
	objContent []byte, doneCh chan struct{}) {
	defer func() {
		doneCh <- struct{}{}
	}()
	obj := client.Bucket(bucketName).Object(objectKey)
	ctxWithTime, cancelFn := context.WithTimeout(ctx, uploadTimeout)
	defer cancelFn()
	writer := obj.NewWriter(ctxWithTime)
	reader := bytes.NewReader(objContent)
	if _ , err := io.Copy(writer, reader); err != nil {
		log.Printf("%s copy: %v", objectKey, err)
		return
	}
	if err := writer.Close(); err != nil {
		log.Printf("%s close: %v", objectKey, err)
		return
	}
}

func putObjectCopies(credentialFile string, bucketName, objectKey string, numCopies int) {
	ctx := context.Background()

	objContent := make([]byte, 5000000)
	if _, err := rand.Read(objContent); err != nil {
		log.Fatalln("error generating random buffer:", err)
	}
	content, err := os.ReadFile(credentialFile)
	if err != nil {
		log.Fatalln("Error reading credentials file: ", err)
	}

	client, err := storage.NewClient(ctx, option.WithCredentialsJSON(content))
	if err != nil {
		log.Fatalln("Error creating client: ", err)
	}
	doneCh := make(chan struct{}, numCopies)
	for i := 0; i < numCopies; i++ {
		go putCopy(ctx, client, bucketName, fmt.Sprintf("%s-%d", objectKey, i),
			objContent, doneCh)
	}
	for i := 0; i < numCopies; i++ {
		<- doneCh
	}
}

The putObjectCopies function simultaneously uploads numCopies copies of an object with a random payload (the same for each object for simplicity).

Expected behavior

Bucket bucketName should contain only copies for which the writer.Close function terminates without any error

Actual behavior

Sometimes the bucket bucketName contains also copies for which the writer.Close function fails with error context deadline exceeded. The issue seems to be triggered only when the timeout for the context (uploadTimeout) is set to a value which allows several uploads of the copies to be completed before context expiration, while other uploads fails due to context expiration while executing writer.Close function. For instance, in my local environment, the issue appears with uploadTimeout set to 33 seconds while uploading 10 copies of an object with 5M random bytes. Note that the issue does not happen in every run of the putObjectCopies function with some uploads failing and other succeeding: indeed, I had to run multiple times the function before actually observing the issue, as the context expiration must probably occur in a short timeframe between the end of the writing of the object in the writer.Close function and the actual end of the writer.Close execution.

@nicholas-mainardi nicholas-mainardi added the triage me I really want to be triaged. label Jul 16, 2021
@nicholas-mainardi nicholas-mainardi changed the title packagename: short description of bug storage: object written to bucket even if Writer.Close fails because of context expiration Jul 16, 2021
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Jul 17, 2021
@cojenco cojenco added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Jul 19, 2021
@cojenco
Copy link
Contributor

cojenco commented Jul 19, 2021

Thank you @nicholas-mainardi for the detailed log. As you suspected, this is likely occurring due to the timeout happening between when the upload is finalized on the server side and when the response is received by the client.

There's a background goroutine that monitors when a context is cancelled, either by a timeout or cancellation. It will always honor the uploadTimeout. Therefore, it’s difficult to prevent this from happening in marginal cases where the timeout happens in the above-mentioned short timeframe.

Though it’s surprising to see this happen more often than expected from your report. We’re looking into the library with your code to ensure that the cancellation is being honored consistently. Also, checking with the team if there’s ways to better mitigate this problem while using parallel composite uploads. Thanks!

@tritone
Copy link
Contributor

tritone commented Jul 21, 2021

I'll add that based on your use case, a reasonable workaround might be:

  • Try to upload a chunk using ifGenerationMatch=0.
  • If the upload fails with a context timeout, try it again, still using ifGenerationMatch=0.
  • If the retry fails with a 412 error, you know that the first upload actually succeeded and the chunk is already present in the bucket.

@nicholas-mainardi
Copy link
Author

Thanks @tritone for the suggestion, that is definitely an interesting workaround for the context timeout issue.

Nonetheless, there is probably another case that I experienced where the reported issue with context cancellation may be troublesome: in my implementation, the context may also be canceled to abort the upload operation, and in that case I need to be sure that all the chunks are erased from the bucket. To this extent, I tried to add some delete operations for all chunks in case of context cancellation. However, there seems to be a slack time between when the writer.Close functions ends and when the written object appears as available in the bucket; therefore, sometimes the uploaded chunk appears in the bucket only after the deletion operation, and therefore it is not erased.

That being said, I understand that it is difficult to prevent such a race condition, but that seems to happen more happen than expected. Thank you very much for the prompt replies and for checking with the team if this issue can be made much less likely to occur.

@tritone
Copy link
Contributor

tritone commented Feb 2, 2022

@nicholas-mainardi FYI, we found and fixed another bug in the generated library which may have been exacerbating this issue. See googleapis/google-api-go-client#1364 and googleapis/google-api-go-client#1359. Basically there was a race condition in the check for cancelation which may have allowed requests to be sent after the client should have known that the context was canceled, but this should now be fixed in the most recent storage release.

I'm going to close this issue for now, but feel free to re-open or open a new issue if you have further questions.

@tritone tritone closed this as completed Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants