-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deque steal race condition #726
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors r+
Build succeeded: |
.compare_exchange(f, f.wrapping_add(1), Ordering::SeqCst, Ordering::Relaxed) | ||
.is_err() | ||
// If the buffer has been swapped or the increment fails, we retry. | ||
if self.inner.buffer.load(Ordering::Acquire, guard) != buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why this can cause troubles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply! I see it says that a task may be popped twice. But I still does not understand why this is the case given the CAS below https://github.com/crossbeam-rs/crossbeam/pull/726/files#diff-43da9e89d19dacafed5e7cc0d1ac11867f059a529d82daa4fdf2c5e39243a849R640-R644.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taiki-e is it possible to make the discussion visible?
Sorry, I don't know how to make the discussion in the GitHub advisory public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @taiki-e, so I'm just quoting here:
The problem in the stealing functions is that we first load the pointer to the buffer, and then read a task from the buffer. Between loading the buffer and reading the task from the buffer, the buffer may have been swapped and we find ourselves reading stale data. But, if we verify that the buffer hasn't been swapped, the task is valid and we can increment/decrement the relevant atomic counter to steal the task. From this point onward we don't care if the buffer is swapped, because we already read a task from it and we're not planning to touch it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between loading the buffer and reading the task from the buffer, the buffer may have been swapped and we find ourselves reading stale data.
Even the buffer may changed during this gap, old buffer and new buffer should have the the same value at position f
. Em.. do we encounter a concrete case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was able to consistently recreate memory corruption with the old version of the code. Below is a more detailed explanation:
To exploit the bug, two things must happen:
- The buffer must be swapped during a steal.
- The data read from the old buffer must be different from the data at the same index in the new buffer.
Note that the buffer is both growable and shrinkable (shrinks when size <= capacity / 4). Therefore lots of pushing, popping and stealing will satisfy condition 1. For me it worked best with one pusher/popper thread and one stealer thread (as can be seen in the script above).
To satisfy condition 2 we need to empty the queue after the buffer had been swapped and then push (at least) one task. If that happens before the stealer finished reading the task, the stealer will "declare" it has stolen the newly pushed task, while it actually read an old task. Now the two threads had popped the same task (the newly pushed task is lost) and the program has entered an undefined state.
Note that both conditions are things that must happen between loading the buffer and reading from it in the stealer thread. For that reason the bug is very likely to occur when using the
steal_batch
andsteal_batch_and_pop
functions with a lifo queue, as these functions load the buffer once and repetitively read tasks from it. In the other cases, the likelihood might depend more heavily on the environment. For example, if many threads are running on the target system, the chances the OS will preempt a thread at the "right" time for an exploit grows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will read this tomorrow. Thank you very much for this.
No description provided.