Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): fix read-write race in Writer.Write #6817

Merged
merged 2 commits into from Oct 13, 2022

Conversation

dragonsinth
Copy link
Contributor

@dragonsinth dragonsinth commented Oct 6, 2022

#6816

w.monitorCancel() -> w.CloseWithError() reads the values of w.opened and w.pw concurrently with the calling routine initializing them; if the incoming context is cancelled before or during initialization.

The fix is simply to deploy starting the monitor until all fields are initialized. The older open() code from previous releases did this in the right order:

https://github.com/googleapis/google-cloud-go/blob/storage/v1.24.0/storage/writer.go#L130-L134

@dragonsinth dragonsinth requested review from a team as code owners October 6, 2022 18:56
@product-auto-label product-auto-label bot added size: xs Pull request size is extra small. api: storage Issues related to the Cloud Storage API. labels Oct 6, 2022
@dragonsinth
Copy link
Contributor Author

Note that the w.mu.Lock()/w.mu.Unlock() in monitorCancel() doesn't help because
a) it only guards w.err
b) openWriter() isn't holding the lock anyway

Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and for the proposed fix!

I think the spot you have put go monitorCancel() is too late. In the old code you linked, we launch monitorCancel after opening the pipe but before launching the go routine that actually sends a request. This reduces the likelihood that we would send a request after the context has already been cancelled. In this PR that logic happens inside the OpenWriter call on L193.

I think the answer is to pass monitorCancel in with the openWriterParams and then launch it from within the transport-specific clients and then have a callback to set Writer.pw. I can provide more direction on how to do this if you like, or implement it myself. I'd also like to add your repro as a test -- we have another test that checks for a similar condition here but this one passes because the cancellation happens after the first write I believe.

@dragonsinth
Copy link
Contributor Author

dragonsinth commented Oct 10, 2022

I think the answer is to pass monitorCancel in with the openWriterParams and then launch it from within the transport-specific clients and then have a callback to set Writer.pw. I can provide more direction on how to do this if you like, or implement it myself. I'd also like to add your repro as a test -- we have another test that checks for a similar condition here but this one passes because the cancellation happens after the first write I believe.

Huh, I'm unfamiliar with the code underneath tc.OpenWriter as that jumps to an interface, but on the surface I wonder if it's really worth breaking through all those layers of abstraction?

I can think of at least two less-invasive ideas that have the effect of avoiding initiating the call:

  1. Explicitly check context immediately before calling OpenWriter, e.g.:
	if err := w.ctx.Err(); err != nil {
		return err // short-circuit
	}
	w.pw, err = w.o.c.tc.OpenWriter(params, opts...)
	if err != nil {
		return err
	}
	w.opened = true
	go w.monitorCancel()

This will prevent the extra call in almost all cases.

  1. Pass w.ctx down to OpenWriter and have it respect context internally; this would be the more standard way of doing things, but a more invasive change.

@tritone
Copy link
Contributor

tritone commented Oct 13, 2022

@dragonsinth Thanks for the suggestions, agreed that those would both be less complicated. I think your option (1) seems like a good idea; option (2) seems like a little more complexity if the pipe has already been opened and we have to do cleanup that currently occurs in monitorCancel. If you want to make that change to this PR I would be happy to approve.

@dragonsinth
Copy link
Contributor Author

@dragonsinth Thanks for the suggestions, agreed that those would both be less complicated. I think your option (1) seems like a good idea; option (2) seems like a little more complexity if the pipe has already been opened and we have to do cleanup that currently occurs in monitorCancel. If you want to make that change to this PR I would be happy to approve.

Sweet, updated this PR.

@dragonsinth dragonsinth changed the title storage: storage.(*Writer).Write() with cancelled context causes write race storage: (*Writer).Write() with cancelled context causes read-write race Oct 13, 2022
@tritone tritone added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 13, 2022
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 13, 2022
@tritone tritone changed the title storage: (*Writer).Write() with cancelled context causes read-write race fix(storage): fix read-write race in Writer.Write Oct 13, 2022
Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll add a test based on your example in the issue in a separate PR.

@tritone tritone merged commit 4766d3e into googleapis:main Oct 13, 2022
@dragonsinth
Copy link
Contributor Author

Awesome, thanks @tritone

@dragonsinth dragonsinth deleted the storagerace branch October 13, 2022 19:15
tritone added a commit to tritone/google-cloud-go that referenced this pull request Oct 13, 2022
Adds a test to detect a potential race condition that can arise
if the context is cancelled before writing to GCS. We already
have a test that detects a similar condition after we have
already started writing the object, but this test looks at the context
being cancelled before any writes occur.

When run with -race, this test fails in storage/v1.27.0, but it
passes when run with the fix in googleapis#6817.

Updates googleapis#6816
tritone added a commit that referenced this pull request Oct 17, 2022
Adds a test to detect a potential race condition that can arise
if the context is cancelled before writing to GCS. We already
have a test that detects a similar condition after we have
already started writing the object, but this test looks at the context
being cancelled before any writes occur.

When run with -race, this test fails in storage/v1.27.0, but it
passes when run with the fix in #6817.

Updates #6816
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. size: xs Pull request size is extra small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants