Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher::send can deadlock #4246

Closed
leoyvens opened this issue Nov 18, 2021 · 11 comments
Closed

Watcher::send can deadlock #4246

leoyvens opened this issue Nov 18, 2021 · 11 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-sync Module: tokio/sync

Comments

@leoyvens
Copy link

leoyvens commented Nov 18, 2021

Version
1.13, this also reproduced in earlier versions but 1.13 seemed to make it more frequent.

Platform
Debian buster-slim

Description
watch::Sender:::send can deadlock. Unfourtunately I do not have a reproduction case as I've only observed it in our production environment, where there are hundreds of receivers for the watcher, and the system is generally under a heavy load in terms of number of tasks and overall number of threads. It deadlocks only after the system has been running for a few hours, so it must be some form of timing condition. But I know for sure that it is deadlocking because a log was inserted immediately before and immediately after the call to send, and the log after is never printed while all other tasks continue to make progress. The only related report I could find in this repo is #3661. Perhaps relevant is that the tasks woken up by the watcher may clone the receiver of that same watcher and call Receiver::changed on it.

@leoyvens leoyvens added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Nov 18, 2021
@Darksonn Darksonn added the M-sync Module: tokio/sync label Nov 18, 2021
@Darksonn
Copy link
Contributor

Well, the watch channel is internally using an RwLock, which is locked inside watch::Sender::send. The same RwLock is locked when you are holding an watch::Ref, which is the type returned by the borrow and borrow_and_update methods. Hence, if you are deadlocking, that is probably because the same thread is also holding an watch::Ref object.

How could this happen? Well, the obvious way is if the same task calls borrow first and then sends, but if it is a rare event, then that doesn't sound like the situation in your case. The other way it can happen is if one task holds the Ref object alive while performing an .await, since your task might yield at that .await, and the other task that sends might be scheduled on the same thread while it has yielded.

That said, the above mistake is normally not possible to hit — the Ref type is not Send, so any attempt to keep a Ref alive across an .await will fail to compile if the task is spawned with tokio::spawn. However, there are some cases where it can happen without the compiler complaining: when tasks are spawned on a LocalSet, or if the tasks are running in a block_on call.

Does any of this ring a bell for your application?

@leoyvens
Copy link
Author

Hi @Darksonn, thank you for your support. borrow and borrow_for_update are not used anywhere in the application.

@Darksonn
Copy link
Contributor

I really don't see how send can block if it's not on the RwLock. Can you please try figuring out which statement in the send_replace method is causing the deadlock? E.g. you can do this by using a modified Tokio with [patch.crates-io] as described here and adding logging around each statement.

@leoyvens
Copy link
Author

I can try to. From reading #3661, is it possible that notify_waiters() is blocking?

@Darksonn
Copy link
Contributor

Well, it's not impossible that there's some problem in it, but the specific issue you have mentioned can only be related if its a livelock, and not a deadlock, and such a livelock would have to involve other threads calling changed() so often that the send call is busy-looping on waking up the new calls to changed() since the last iteration of the loop, which I find unlikely.

@leoyvens
Copy link
Author

leoyvens commented Nov 19, 2021

That does seems unlikely. If relevant, we call changed() from a task::unconstrained run with Handle::block_on on a dedicated OS thread.

@Darksonn
Copy link
Contributor

Were you able to determine which line it gets stuck at?

@arnauorriols
Copy link

Apologies in advance for hijacking this issue. I landed here because I did myself find in the situation described by @Darksonn (#4246 (comment)), where holding the borrow across an await lead to a deadlock.

I find it a bit counter-intuitive that a sync primitive offered by tokio is in fact not suitable for being held across await. I understand the performance implications and the usual recommendation to use std sync unless necessary, and I also realize, in hindsight, that the documented RwLock couldn't possibly be a tokio::sync::RwLock since borrow() is not async. But nonetheless, this kinda breaks the implicit expectation that Tokio primitives are fully compatible with async code, and so you can lower your guard when using them.

I'm confident my use case is reasonable: I have the sender spawned into a separate Task, while the receiver is in the main thread, thus not needing to be Send. Also, in my particular case, I was using the lock logically to avoid the sender from changing the value while I did some async operations on it.

Ideally I would recommend switching to a tokio::sync::RwLock but I imagine that's too much of a breaking change. I do think that at least it should be warned in the documentation of borrow that the Ref cannot be held across await, and showing an example on how to use a separate tokio::sync::RwLock to synchronize Sender and Receiver if it needs to be used across awaits.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 2, 2022

Yes, it's unfortunate that the watch channel can be used incorrectly in spawn_local. It's not possible to fix it, unfortunately. I would be happy to see more documentation about it.

I was using the lock logically to avoid the sender from changing the value while I did some async operations on it.

That is not a supported use-case of the watch channel. If you need to block the value from changing, then you can't use a watch channel. Consider cloning the value instead so that you keep the old value even when a new one is sent.

@leoyvens
Copy link
Author

leoyvens commented Jun 3, 2022

On the original issue, I haven't been able to gather more info, so feel free to close this as unactionable.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 4, 2022

Okay. I'll close this as unactionable.

I opened #4741 for the warning about usage in non-Send code.

@Darksonn Darksonn closed this as completed Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-sync Module: tokio/sync
Projects
None yet
Development

No branches or pull requests

3 participants