Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deque steal race condition #726

Merged
merged 1 commit into from Jul 30, 2021
Merged

Conversation

kmaork
Copy link
Contributor

@kmaork kmaork commented Jul 30, 2021

No description provided.

Copy link
Member

@taiki-e taiki-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

@bors
Copy link
Contributor

bors bot commented Jul 30, 2021

Build succeeded:

@bors bors bot merged commit 3e72cde into crossbeam-rs:master Jul 30, 2021
@kmaork kmaork deleted the deque_race_fix branch July 30, 2021 22:50
.compare_exchange(f, f.wrapping_add(1), Ordering::SeqCst, Ordering::Relaxed)
.is_err()
// If the buffer has been swapped or the increment fails, we retry.
if self.inner.buffer.load(Ordering::Acquire, guard) != buffer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this can cause troubles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details are in the discussion here, but it seems that it's not public currently. @taiki-e is it possible to make the discussion visible? I could also copy here the explanation that I wrote there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply! I see it says that a task may be popped twice. But I still does not understand why this is the case given the CAS below https://github.com/crossbeam-rs/crossbeam/pull/726/files#diff-43da9e89d19dacafed5e7cc0d1ac11867f059a529d82daa4fdf2c5e39243a849R640-R644.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taiki-e is it possible to make the discussion visible?

Sorry, I don't know how to make the discussion in the GitHub advisory public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @taiki-e, so I'm just quoting here:

The problem in the stealing functions is that we first load the pointer to the buffer, and then read a task from the buffer. Between loading the buffer and reading the task from the buffer, the buffer may have been swapped and we find ourselves reading stale data. But, if we verify that the buffer hasn't been swapped, the task is valid and we can increment/decrement the relevant atomic counter to steal the task. From this point onward we don't care if the buffer is swapped, because we already read a task from it and we're not planning to touch it again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between loading the buffer and reading the task from the buffer, the buffer may have been swapped and we find ourselves reading stale data.

Even the buffer may changed during this gap, old buffer and new buffer should have the the same value at position f. Em.. do we encounter a concrete case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was able to consistently recreate memory corruption with the old version of the code. Below is a more detailed explanation:

To exploit the bug, two things must happen:

  1. The buffer must be swapped during a steal.
  2. The data read from the old buffer must be different from the data at the same index in the new buffer.

Note that the buffer is both growable and shrinkable (shrinks when size <= capacity / 4). Therefore lots of pushing, popping and stealing will satisfy condition 1. For me it worked best with one pusher/popper thread and one stealer thread (as can be seen in the script above).

To satisfy condition 2 we need to empty the queue after the buffer had been swapped and then push (at least) one task. If that happens before the stealer finished reading the task, the stealer will "declare" it has stolen the newly pushed task, while it actually read an old task. Now the two threads had popped the same task (the newly pushed task is lost) and the program has entered an undefined state.

Note that both conditions are things that must happen between loading the buffer and reading from it in the stealer thread. For that reason the bug is very likely to occur when using the steal_batch and steal_batch_and_pop functions with a lifo queue, as these functions load the buffer once and repetitively read tasks from it. In the other cases, the likelihood might depend more heavily on the environment. For example, if many threads are running on the target system, the chances the OS will preempt a thread at the "right" time for an exploit grows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will read this tomorrow. Thank you very much for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants