sync: fix racy `UnsafeCell` access on a closed oneshot #4226

hawkw · 2021-11-12T18:36:07Z

Motivation

Issue #4225 describes a potential race condition in the oneshot
channel in tokio-sync. The race occurs when the oneshot is closed, and
then a thread calls oneshot::Sender::send while another thread is
calling oneshot::Receiver::try_recv or awaiting the reciever. The call
to send puts the sent value in the cell. Then, it sets the
VALUE_SENT (completed) bit, returning a snapshot of the state. If the
state is closed, the sender takes the sent value back out of the cell.
Receiving from the channel also loads the state. The reciever first
checks if the channel is completed, and takes the value out of the cell
if the completed bit is set. This is racy, because if a call to recv
loads the state after the completed bit has been set, it will no longer
treat the channel as closed, and will try to take the value out of the
cell. This results in both threads accessing the cell concurrently.

Solution

This branch adds two loom tests reproducing the race, for try_recv
and for awaiting a Receiver. Both of these tests fail with a causality
violation when accessing the UnsafeCell on master.

I fixed the race condition by changing State::set_completed to not set
the VALUE_SENT bit if the CLOSED bit has already been set. This way,
if the channel is closed before send is called, we won't set the bit
that tells the receiver that it's okay to access the UnsafeCell, and
we the sender can take the value back out without the risk of the
receiver also concurrently accessing the UnsafeCell.

This required changing State::set_completed from a simple fetch-or to
a compare-and-swap loop, which is slightly a bummer, as it's less
efficient than a single fetch-or. However, as far as I can tell,
changing this to a CAS loop is the only solution to the race that
doesn't also change the oneshot's behavior. We currently have tests
asserting that if the channel is closed after a value is sent, the
value is still received. A simpler solution to this problem would be
changing poll_recv and try_recv to check if the channel is closed
first, and only check if the channel is completed and take the value
out if it is not closed. This fixes the race, since the receiver will no
longer try to access the UnsafeCell if the channel has been closed.
But, this would break the existing tests asserting that calling close
after a value has been sent still results in the value being received.
Changing that behavior seems undesirable, so I thought this was worth
the CAS loop.

Fixes: #4225

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

This fixes the race. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw · 2021-11-12T18:38:45Z

tokio/src/sync/oneshot.rs

+        loop {
+            if State(state).is_closed() {
+                break;
+            }


I also considered writing this as

Suggested change

loop {

if State(state).is_closed() {

break;

}

while !State(state).is_closed() {

but i had a vague memory of @carllerche saying that he doesn't like the use of while loops for compare-and-swap loops... :)

I think the way it's written here with a loop is perfectly readable, so let's just keep it.

Darksonn

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code. What you are doing here sounds reasonable, but I'd need to read the entire implementation to be confident that it solves the problem.

Darksonn · 2021-11-12T18:54:13Z

tokio/src/sync/oneshot.rs

+        loop {
+            if State(state).is_closed() {
+                break;
+            }


I think the way it's written here with a loop is perfectly readable, so let's just keep it.

hawkw · 2021-11-12T19:05:02Z

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code.

Yeah, I was also thinking about trying to clean up more of the existing implementation (maybe in a follow-up...). But, I'm pretty confident that this change fixes the specific crash in question.

Darksonn · 2021-11-12T19:27:06Z

tokio/src/sync/oneshot.rs

+        // `poll_recv` or `try_recv` call is occurring concurrently, both
+        // threads may try to access the `UnsafeCell` if we were to set the
+        // `VALUE_SENT` bit on a closed channel.
+        let mut state = cell.load(Ordering::Relaxed);


This seems safer.

Suggested change

let mut state = cell.load(Ordering::Relaxed);

let mut state = cell.load(Ordering::Acquire);

I don't think it's necessary; the compare_exchange has an Acquire failure ordering, so when we actually go to set the value, we will always perform an Acquire operation, and if the initial Relaxed load is stale, the CAS will fail, and we'll loop again with a value that was accessed with an Acquire ordering. So, making the initial load Relaxed really just serves to avoid an unnecessary Acquire in the case that we're the only thread writing to the state; if it's contended and the Relaxed load is stale, we will always perform an Acquire on subsequent iterations.

Other CAS loops elsewhere in Tokio follow a similar pattern:

tokio/tokio/src/time/driver/entry.rs

Lines 167 to 192 in 03969cd

// Quick initial debug check to see if the timer is already fired. Since

// firing the timer can only happen with the driver lock held, we know

// we shouldn't be able to "miss" a transition to a fired state, even

// with relaxed ordering.

let mut cur_state = self.state.load(Ordering::Relaxed);

loop {

debug_assert!(cur_state < STATE_MIN_VALUE);

if cur_state > not_after {

break Err(cur_state);

}

match self.state.compare_exchange(

cur_state,

STATE_PENDING_FIRE,

Ordering::AcqRel,

Ordering::Acquire,

) {

Ok(_) => {

break Ok(());

}

Err(actual_state) => {

cur_state = actual_state;

}

}

However, I'm happy to change this if you prefer.

hawkw · 2021-11-12T22:00:20Z

As a note, I can run my repro from #4225 (comment) with this branch for over 40 minutes without seeing any segfaults or mangled strings.

hawkw · 2021-11-13T00:23:31Z

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code.

Yeah, I was also thinking about trying to clean up more of the existing implementation (maybe in a follow-up...). But, I'm pretty confident that this change fixes the specific crash in question.

#4229 should help make some of the oneshot implementation's invariants a bit clearer, I hope!

Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

# 1.14.0 (November 15, 2021) ### Fixed - macros: fix compiler errors when using `mut` patterns in `select!` ([#4211]) - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) - sync: make `AtomicWaker` panic safe ([#3689]) - runtime: fix basic scheduler dropping tasks outside a runtime context ([#4213]) ### Added - stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223]) ### Changed - io: updated `copy` buffer size to match `std::io::copy` ([#4209]) ### Documented - io: rename buffer to file in doc-test ([#4230]) - sync: fix Notify example ([#4212]) [#4211]: #4211 [#4226]: #4226 [#3689]: #3689 [#4213]: #4213 [#4179]: #4179 [#4223]: #4223 [#4209]: #4209 [#4230]: #4230 [#4212]: #4212

Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

# 1.13.1 (November 15, 2021) ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) [#4226]: #4226

Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

# 1.13.1 (November 15, 2021) ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) [#4226]: #4226

Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

# 1.8.4 (November 15, 2021) This release backports a bug fix from 1.13.1. ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) [#4226]: #4226

# 1.14.0 (November 15, 2021) ### Fixed - macros: fix compiler errors when using `mut` patterns in `select!` ([#4211]) - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) - sync: make `AtomicWaker` panic safe ([#3689]) - runtime: fix basic scheduler dropping tasks outside a runtime context ([#4213]) ### Added - stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223]) ### Changed - io: updated `copy` buffer size to match `std::io::copy` ([#4209]) ### Documented - io: rename buffer to file in doc-test ([#4230]) - sync: fix Notify example ([#4212]) [#4211]: #4211 [#4226]: #4226 [#3689]: #3689 [#4213]: #4213 [#4179]: #4179 [#4223]: #4223 [#4209]: #4209 [#4230]: #4230 [#4212]: #4212

Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

# 1.8.4 (November 15, 2021) This release backports a bug fix from 1.13.1. ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) [#4226]: #4226

A data race condition was discovered in tokio, which can lead to memory corruption. This vulnerability affects our fork. See: - https://rustsec.org/advisories/RUSTSEC-2021-0124 - tokio-rs/tokio#4225 - tokio-rs/tokio#4226 - fengalin/tokio#1 Fixes: https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs/-/issues/174

cynecx · 2021-11-17T18:56:27Z

tokio/src/sync/oneshot.rs

+            }
+            // TODO: This could be `Release`, followed by an `Acquire` fence *if*
+            // the `RX_TASK_SET` flag is set. However, `loom` does not support
+            // fences yet.


I think loom has partial support for fences iiuc (tokio-rs/loom#220), so technically this comment isn't correct anymore :D.

Yeah, I think there might be a couple other places in Tokio where we also can implement some things nicer now that loom supports fences. I didn't want to actually change that aspect of the implementation in the bugfix branch, though. Might be worth doing in a subsequent one!

# 1.8.4 (November 15, 2021) This release backports a bug fix from 1.13.1. ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed (tokio-rs#4226)

Arnavion · 2022-01-04T19:40:10Z

Is there any plan to backport this to 0.1.x ?

Darksonn · 2022-01-04T20:18:47Z

We don't have any plans, no.

Arnavion · 2022-01-04T20:27:54Z

Okay, thanks.

hawkw added 8 commits November 12, 2021 09:42

loom test for #4225

844dc9b

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

oneshot: prioritize closed over complete

c8f82d4

This fixes the race. Signed-off-by: Eliza Weisman <eliza@buoyant.io>

add a similar test for poll_recv

5ae003c

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

also prioritize closed in poll_recv

a6b60ed

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

remove dbgs

54b740d

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

fix test failing when the channel's closed

2324301

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

restore prev behavior: honor completed over closed

ef15fba

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

add comments in set_complete

897e71e

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw added C-bug Category: This is a bug. M-sync Module: tokio/sync I-crash Problems and improvements related to program crashes/panics. R-loom Run loom tests on this PR labels Nov 12, 2021

hawkw requested review from carllerche and Darksonn November 12, 2021 18:36

hawkw self-assigned this Nov 12, 2021

github-actions bot removed the R-loom Run loom tests on this PR label Nov 12, 2021

hawkw commented Nov 12, 2021

View reviewed changes

Darksonn added the R-loom Run loom tests on this PR label Nov 12, 2021

Darksonn reviewed Nov 12, 2021

View reviewed changes

hawkw requested a review from Darksonn November 12, 2021 20:32

hawkw mentioned this pull request Nov 13, 2021

oneshot: document UnsafeCell invariants #4229

Merged

Darksonn approved these changes Nov 14, 2021

View reviewed changes

Darksonn merged commit 4b78ed4 into master Nov 14, 2021

Darksonn deleted the eliza/oneshot-race branch November 14, 2021 17:03

hawkw mentioned this pull request Nov 15, 2021

chore: prepare Tokio v1.14.0 #4234

Merged

hawkw added a commit that referenced this pull request Nov 15, 2021

sync: fix racy UnsafeCell access on a closed oneshot (#4226)

1774831

hawkw added a commit that referenced this pull request Nov 15, 2021

chore: prepare Tokio v1.13.1

338fed8

# 1.13.1 (November 15, 2021) ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) [#4226]: #4226

hawkw mentioned this pull request Nov 15, 2021

chore: prepare Tokio v1.13.1 #4235

Merged

hawkw added a commit that referenced this pull request Nov 15, 2021

sync: fix racy UnsafeCell access on a closed oneshot (#4226)

ab0e60d

hawkw added a commit that referenced this pull request Nov 16, 2021

sync: fix racy UnsafeCell access on a closed oneshot (#4226)

0c83c37

hawkw mentioned this pull request Nov 16, 2021

chore: prepare Tokio v1.8.4 #4237

Merged

hawkw added a commit that referenced this pull request Nov 16, 2021

sync: fix racy UnsafeCell access on a closed oneshot (#4226)

e0cf906

fengalin pushed a commit to fengalin/tokio that referenced this pull request Nov 17, 2021

sync: fix racy UnsafeCell access on a closed oneshot (tokio-rs#4226)

e6756d0

fengalin mentioned this pull request Nov 17, 2021

Apply fix for RUSTSEC-2021-0124 fengalin/tokio#1

Merged

fengalin pushed a commit to fengalin/tokio that referenced this pull request Nov 17, 2021

sync: fix racy UnsafeCell access on a closed oneshot (tokio-rs#4226)

4b68481

cynecx reviewed Nov 17, 2021

View reviewed changes

wyhgoodjob mentioned this pull request Dec 31, 2021

CVE-2021-45710 issue in actix-web's dependency - tokio actix/actix-net#432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: fix racy `UnsafeCell` access on a closed oneshot #4226

sync: fix racy `UnsafeCell` access on a closed oneshot #4226

hawkw commented Nov 12, 2021

hawkw Nov 12, 2021

Darksonn Nov 12, 2021

Darksonn left a comment

Darksonn Nov 12, 2021

hawkw commented Nov 12, 2021

Darksonn Nov 12, 2021

hawkw Nov 12, 2021

hawkw commented Nov 12, 2021

hawkw commented Nov 13, 2021

cynecx Nov 17, 2021 •

edited

hawkw Nov 17, 2021

Arnavion commented Jan 4, 2022

Darksonn commented Jan 4, 2022

Arnavion commented Jan 4, 2022

	let mut state = cell.load(Ordering::Relaxed);
	let mut state = cell.load(Ordering::Acquire);

	// Quick initial debug check to see if the timer is already fired. Since
	// firing the timer can only happen with the driver lock held, we know
	// we shouldn't be able to "miss" a transition to a fired state, even
	// with relaxed ordering.
	let mut cur_state = self.state.load(Ordering::Relaxed);

	loop {
	debug_assert!(cur_state < STATE_MIN_VALUE);

	if cur_state > not_after {
	break Err(cur_state);
	}

	match self.state.compare_exchange(
	cur_state,
	STATE_PENDING_FIRE,
	Ordering::AcqRel,
	Ordering::Acquire,
	) {
	Ok(_) => {
	break Ok(());
	}
	Err(actual_state) => {
	cur_state = actual_state;
	}
	}

sync: fix racy UnsafeCell access on a closed oneshot #4226

sync: fix racy UnsafeCell access on a closed oneshot #4226

Conversation

hawkw commented Nov 12, 2021

Motivation

Solution

hawkw Nov 12, 2021

Choose a reason for hiding this comment

Darksonn Nov 12, 2021

Choose a reason for hiding this comment

Darksonn left a comment

Choose a reason for hiding this comment

Darksonn Nov 12, 2021

Choose a reason for hiding this comment

hawkw commented Nov 12, 2021

Darksonn Nov 12, 2021

Choose a reason for hiding this comment

hawkw Nov 12, 2021

Choose a reason for hiding this comment

hawkw commented Nov 12, 2021

hawkw commented Nov 13, 2021

cynecx Nov 17, 2021 • edited

Choose a reason for hiding this comment

hawkw Nov 17, 2021

Choose a reason for hiding this comment

Arnavion commented Jan 4, 2022

Darksonn commented Jan 4, 2022

Arnavion commented Jan 4, 2022

sync: fix racy `UnsafeCell` access on a closed oneshot #4226

sync: fix racy `UnsafeCell` access on a closed oneshot #4226

cynecx Nov 17, 2021 •

edited