New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping threaded runtime with time and IO enabled results in memory leaks #2535
Comments
How reliably to you reproduce this? I have attempted to run the example but have not seen that error. |
One potential problem that I do see is, unless we collect all the join handles of spawned threads, there is potential for valgrind to complain. |
This is reliably reproducible for me with My colleague also reports the same results on his machine with
I wonder what setup do you have in regard to software versions (assuming you're running x86_64)? |
It's going to be racy repro. I'm running it in a virtualbox VM. |
Out of curiosity, if you do the following, does the leak still happen? fn main() {
let rt = tokio::runtime::Builder::new()
.threaded_scheduler()
.max_threads(1)
.enable_time()
.enable_io()
.build()
.unwrap();
rt.spawn(my_loop());
std::thread::sleep(std::time::Duration::from_secs(1));
// New code here
drop(rt);
std::thread::sleep(std::time::Duration::from_secs(1));
} |
It is indeed racy. Now I only get a leak in 50% of the cases:
|
Seems to be related to #1830. |
I can reproduce this locally. I don't see how it is related to #1830, you don't even spawn tasks anywhere. |
Eh? |
Ah, I can't read. Anyway, it is not related. The runtime should drop running tasks when dropped. |
If folks are still having trouble, this reproduces really reliably for me with the rust-rdkafka test suite. There only caveat is that there is a dependency on Docker/Docker Compose to get Kafka running. git clone https://github.com/fede1024/rust-rdkafka.git
cd rust-rdkafka
docker-compose up -d
cargo test --no-run
valgrind --error-exitcode=100 --leak-check=full target/debug/test_high_consumers-* --nocapture --test-threads=1 |
It is unclear to me if this is an actual bug or an unfortunate race. One way to check would be to track all If valgrind still complains after, then something else is going on. |
I tried your suggestion @carllerche and Valgrind still complained. And indeed if it were just a shutdown race I'd really expect the code sample you posted above fn main() {
let rt = tokio::runtime::Builder::new()
.threaded_scheduler()
.max_threads(1)
.enable_time()
.enable_io()
.build()
.unwrap();
rt.spawn(my_loop());
std::thread::sleep(std::time::Duration::from_secs(1));
// New code here
drop(rt);
std::thread::sleep(std::time::Duration::from_secs(1));
} not to exhibit the race. And yet it does with quite a bit of regularity. As best as I can tell, an |
It seems like there might be a reference counting cycle. The I'm not very familiar with this code, and there are more than a few layers of indirection, so it's hard for me to investigate much further. I wanted to try telling the runtime to drain the |
Thanks for the investigation. I will try to dig in some shortly and will report back what I find. |
Awesome, thanks @carllerche! One thing that occurs to me: when you attempting to repro initially, were you running Valgrind with |
More clues! This patch "fixes" the leak. diff --git a/tokio/src/runtime/task/core.rs b/tokio/src/runtime/task/core.rs
index f4756c23..08d0fe83 100644
--- a/tokio/src/runtime/task/core.rs
+++ b/tokio/src/runtime/task/core.rs
@@ -257,11 +257,15 @@ impl<T: Future, S: Schedule> Core<T, S> {
let task = ManuallyDrop::new(task);
- self.scheduler.with(|ptr| {
+ self.scheduler.with_mut(|ptr| {
// Safety: Can only be called after initial `poll`, which is the
// only time the field is mutated.
match unsafe { &*ptr } {
- Some(scheduler) => scheduler.release(&*task),
+ Some(scheduler) => {
+ let out = scheduler.release(&*task);
+ unsafe { *ptr = None };
+ out
+ }
// Task was never polled
None => None,
} This ensures that when the worker shuts down and releases all associated tasks, those tasks no longer hold a reference to the worker. That way, even if the time driver is holding a reference to one of those tasks, that task doesn't keep the worker alive. I say "fixes" because there are very clearly complicated safety requirements around mutating the |
Sorry for the delay, I was setting up a new linux box and now I can repro 💯 |
I can also see that your patch fixes the memory leak. Unfortunately, it is not thread safe. The scheduler arc should be dropped when the task is dropped, but it seems like this is not happening. I will try to dig into why the leak is happening. |
Yep, definitely didn’t expect that patch to be mergeable as is. Just
thought it might prove a helpful clue. I didn’t see any other quick way to
break what seems to be a circular dependency between the worker shared
state and the time driver. I’ve no doubt you have some better ideas on that
front than I.
…On Sun, May 31, 2020 at 5:43 PM Carl Lerche ***@***.***> wrote:
I can also see that your patch fixes the memory leak. Unfortunately, it is
not thread safe. The scheduler arc should be dropped when the task is
dropped, but it seems like this is not happening. I will try to dig into
why the leak is happening.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2535 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXSIF7HVMZIUZPJ5YIKNLRULFORANCNFSM4NAAP56A>
.
|
The problem is the time driver holds wakers for the task but we currently don't force the time driver to purge on drop. |
More specifically, the time driver handles this problem by using a weak reference in the unpark handle. However, the threaded runtime uses a custom Parker which does not. Fixing this will probably require a bit of a cleanup of this subsystem. |
In threaded runtime, the unparker now owns a weak reference to the inner data. This breaks the cycle of Arc and properly releases the io driver and its worker threads.
In threaded runtime, the unparker now owns a weak reference to the inner data. This breaks the cycle of Arc and properly releases the io driver and its worker threads.
Looking at this with tokio 1.0.1 and an updated reproducer, there are no more Updated reproduceruse tokio::time::sleep;
fn main() {
let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.enable_time()
.enable_io()
.build()
.unwrap();
rt.spawn(my_loop());
std::thread::sleep(std::time::Duration::from_secs(1));
/*
drop(rt);
std::thread::sleep(std::time::Duration::from_secs(1));
*/
}
async fn my_loop() {
println!("I'm alive!");
loop {
sleep(std::time::Duration::from_millis(100)).await;
}
}
If Did not look into the implementation(s), just stumbled upon the issue and wanted to see if this was accidentially left open. |
Agreed, this issue is also confirmed resolved for me in rust-rdkafka with Tokio v1.0.1. |
Great, closing 👍 |
Version
Platform
Description
This code:
When run under Valgrind produces the following:
From the description of Runtime.shutdown_timeout():
I was expecting the runtime to either:
my_loop()
at the nextawait
and drop itmy_loop()
to be ready..and then to terminate cleanly on
Drop
(contrary to whatRuntime.shutdown_timeout()
might do when timeout expires).Instead, runtime wasn't terminated cleanly for some reason.
The text was updated successfully, but these errors were encountered: