-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
census: Fix retry stats data race #8422
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Why do we need another atomic variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the first atomic variable
is doing something after the atomic check, and that thing can not be done atomically with the decrement. If another thread is using the same atomic variable, it will be broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be? There's no critical section, so it seems "any time after decrementAndGet() == 0 must be safe."
What atomic variable? How would another thread be using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not safe when
callEnded()
reports RETRY_DELAY_PER_CALL, to be exaggerated adding a long sleep as follows:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A race on
retryDelayNanos
wouldn't have been detected by a race detector. The race detector noticed a race onstopwatch
. But why would there be a race onstopwatch
, sincestopwatch.elapsed()
is read-only? (After all, we could have just saved aSystem.nanoTime()
ourselves and that wouldn't change over time.)It looks like the
stopwatch.stop()
incallEnded()
shouldn't be there, as it adds no value and introduces a write that could interfere withattemptEnded()
. I agree that you've found an additional race though.Concerning updating
retryDelayNanos
inattemptEnded()
, I think we should just "not do that." Instead of taking the full RPC duration and subtracting out the time in each attempt, can we instead just add the individual time periods when there has been no attempt? That is, we update the atomic when a delay period ends instead of when it starts? It looks like that'd be as simple as:That has the advantage that the stopwatch won't be racy and will be exactly
0
if there were no retries.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read/write operations inside the if-block
can happen concurrently with the write/read operations callEnded(). So the race detector noticed
stopwatch.elapsed()
first.I agree.
It works for retry but not for hedging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was basically getting to a part that "there has to be a bug." And that bug was we were doing a write with
stopwatch.stop()
.I just came here to mention that, as it just dawned on me. There are ways to fix that, but they wouldn't be trivial. I think the answer is "use a lock."
I'd be willing to go further moving "anything that is used for control flow" under the lock. So
attemptsPerCall
andcallEnded
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. The original atomic counter
activeStream
was already too complex. I was not using a lock in the origin implementation just because I had feeling that CensusStatsModule is performance sensitive (I saw it's usingAtomicReferenceFieldUpdater
s). Even it's viable to fix without a lock, I'd prefer using lock.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that too, for fixing the
isReady()
-halfClose race. Using a lock seems only possible solution in that case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest possible solution is to make
MessageFramer.closed
volatile. Better though I think is to just stop usingframer().isClosed()
inisReady()
check.AbstractStream
(or the subclasses) can just keep their own volatile variable that is set whenhalfClose()
/close()
is called. In fact,AbstractClientStream
already has such a variable:AbstractClientStream.TransportState.outboundClosed
.