Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on drop of reqwest::blocking::Response sometimes #746

Closed
karlri opened this issue Dec 19, 2019 · 21 comments · Fixed by tokio-rs/tokio#2062
Closed

Deadlock on drop of reqwest::blocking::Response sometimes #746

karlri opened this issue Dec 19, 2019 · 21 comments · Fixed by tokio-rs/tokio#2062
Labels
bug This is bug! upstream Blocked: upstream. Depends on a dependency to make a change first.

Comments

@karlri
Copy link

karlri commented Dec 19, 2019

Sometimes, when a reqwest::blocking::Response is dropped, it appears to deadlock and the drop call never returns. On my setup, it appears to work roughly 70% of the time. Tested on latest master code as of right now.

fn main() {
    for _ in 0..10 {
        may_deadlock();
    }
}

fn may_deadlock() {
    let stream_url = "https://live-bauerse-fm.sharp-stream.com/nrj_instreamtest_se_mp3";

    let mut stream = {
        // setup
        let mut builder = reqwest::blocking::Client::builder();
        // note, I used a proxy:
        // builder = builder.proxy(reqwest::Proxy::all("----").unwrap());
        // get the stream
        builder = builder.connect_timeout(std::time::Duration::from_secs(2));
        builder = builder.timeout(std::time::Duration::from_secs(2));
        let client = builder.build().unwrap();
        let builder = client.get(stream_url);
        builder.send().unwrap()
    };

    println!("ok, reading some data!");
    let mut buffer = [0; 1400];
    use std::io::Read;
    stream.read(&mut buffer).unwrap();
    println!("attempting to drop stream. This will deadlock SOMETIMES...");
    std::mem::drop(stream);
    println!("we get here as expected ~70% of the time");
}

ok, reading some data!
attempting to drop stream. This will deadlock SOMETIMES...
we get here as expected ~70% of the time
ok, reading some data!
attempting to drop stream. This will deadlock SOMETIMES...
we get here as expected ~70% of the time
ok, reading some data!
attempting to drop stream. This will deadlock SOMETIMES...
we get here as expected ~70% of the time
ok, reading some data!
attempting to drop stream. This will deadlock SOMETIMES...

then, stuck.

@seanmonstar
Copy link
Owner

Interesting... I noticed the blocking client is using futures-channel, which may still have a race condition when closing (rust-lang/futures-rs#909). Would you be able to check if #748 fixes this for you?

@karlri
Copy link
Author

karlri commented Dec 20, 2019

It indeed looks like the problem is resolved! Thanks for quick fix!
Edit: Sorry, spoke too soon, see below

@karlri
Copy link
Author

karlri commented Dec 20, 2019

Actually no, it seems to make the deadlock less likely. Main thread BT:

#0  0x00007f4081a5f6dd in pthread_join (
    threadid=139915023746816, thread_return=0x0)
    at pthread_join.c:90
#1  0x00005569486add5d in std::sys::unix::thread::Thread::join::h0dd4634e5c3886ac ()
    at src/libstd/sys/unix/thread.rs:179
#2  0x00005569482e3f51 in std::thread::JoinInner$LT$T$GT$::join::h1f2d05cc866272cd (
    self=0x7ffd3c600b48)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libstd/thread/mod.rs:1327
#3  0x00005569482e3fc2 in std::thread::JoinHandle$LT$T$GT$::join::he18ad55cd4a5ccd1 (self=...)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libstd/thread/mod.rs:1460
#4  0x000055694834b5f2 in _$LT$reqwest..blocking..client..InnerClientHandle$u20$as$u20$core..ops..drop..Drop$GT$::drop::_$u7b$$u7b$closure$u7d$$u7d$::h6322640f342aadad (h=...)
    at reqwest/src/blocking/client.rs:581
#5  0x000055694828a6d0 in core::option::Option$LT$T$GT$::map::hd88b77d2e830f5b9 (self=..., f=...)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/option.rs:447
#6  0x00005569482e1a50 in _$LT$reqwest..blocking..client..InnerClientHandle$u20$as$u20$core..ops..drop..Drop$GT$::drop::hb6fe9cab38400c64 (
    self=0x55694a62b8e0)
    at reqwest/src/blocking/client.rs:581
#7  0x00005569481c69b5 in core::ptr::real_drop_in_place::hf1dea16a8e08991a ()
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:175
#8  0x00005569482a7a14 in core::ptr::drop_in_place::h387838fc20a3f1ac (to_drop=0x55694a62b8e0)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:165
#9  alloc::sync::Arc$LT$T$GT$::drop_slow::hfa5e9608c48264e2 (self=0x7ffd3c6024a8)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/liballoc/sync.rs:703
#10 0x00005569482a8958 in _$LT$alloc..sync..Arc$LT$T$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::hb7e0b751c8637854 (self=0x7ffd3c6024a8)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/liballoc/sync.rs:1225
#11 0x00005569481a9b3f in core::ptr::real_drop_in_place::h3462ae8feecfdbef ()
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:175
#12 0x00005569481aa37b in core::ptr::real_drop_in_place::ha31e0549b0554145 ()
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:175
#13 0x00005569481a9e6e in core::ptr::real_drop_in_place::h4fd0c8f05dcadaa2 ()
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:175
#14 0x00005569481aa19e in core::ptr::real_drop_in_pl---Type <return> to continue, or q <return> to quit---ace::h8b9a13bff1d36747 ()
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/ptr/mod.rs:175
#15 0x00005569481a9466 in core::mem::drop::hc9a625d36b0c0444 (_x=...)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/mem/mod.rs:706
#16 0x00005569481ac8e3 in reqwest_lockup::may_deadlock::h9b97e9fee10e3133 ()
    at src/main.rs:27
#17 0x00005569481ac40b in reqwest_lockup::main::h5f0d353239e74092 () at src/main.rs:3
#18 0x00005569481ac190 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hdf09220ccec15447 () at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libstd/rt.rs:64
#19 0x00005569486aad83 in std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::ha04574f12d97cbe2 () at src/libstd/rt.rs:49
#20 std::panicking::try::do_call::h7c2a8488f72db90c () at src/libstd/panicking.rs:292
#21 0x00005569486ae8fa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:80
#22 0x00005569486ab90d in std::panicking::try::hc3a9b5da4250385d ()
    at src/libstd/panicking.rs:271
#23 std::panic::catch_unwind::hf27600bf8c37809a () at src/libstd/panic.rs:394
#24 std::rt::lang_start_internal::h409d4f2fe51133b0 () at src/libstd/rt.rs:48
#25 0x00005569481ac169 in std::rt::lang_start::h2d8bb13c8f7b0694 (
    main=0x5569481ac390 <reqwest_lockup::main::h5f0d353239e74092>, argc=1, 
    argv=0x7ffd3c602858)
    at /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libstd/rt.rs:64
#26 0x00005569481aca5a in main ()

it appears to hang at: client.rs line 581 on branch self.thread.take().map(|h| h.join());

other threads:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f408297d900 (LWP 16842))]
#0  0x00007f4081a5f6dd in pthread_join (threadid=139915023746816, thread_return=0x0)
    at pthread_join.c:90
90      in pthread_join.c
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f40814a0700 (LWP 16844))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.

@Shnatsel
Copy link

I can reproduce this in v0.10.0 (git commit 35c6ddd). Happens 6% of the time (measured on a sample of 1,000,000 requests).

I've modified the blocking example to make it read the url from the command line and have a shot timeout, compiled it with cargo build --release --features="native-tls,gzip,blocking,cookies" --example blocking and invoked it in a loop like this:

cut -d , -f 3 ../majestic-million-31dec2019.csv | while read url; do target/release/examples/blocking "$url"; done

Majestic million CSV obtained here.

Normally this prints "Done.", the binary exits and the next iteration is immediately invoked, printing "GET ...". In about a minute you can see it print "Done." without following it up with "GET..." and progress halts.

I'm on Ubuntu 18.04, rustc stable 1.40.0

@Shnatsel
Copy link

Shnatsel commented Jan 1, 2020

A deadlock also happens on Windows, so this bug is probably not platform-specific. Thanks to @DutchGhost for testing.

@seanmonstar seanmonstar added the bug This is bug! label Jan 2, 2020
@seanmonstar
Copy link
Owner

I'm looking into this now. First step is I'm adding logs to try to help narrow down where the hang is in #759. My current test environment only has 1 CPU, so it's hard for me to reproduce races involving threads.

@seanmonstar
Copy link
Owner

After some deep code diving, I have a slight hunch. Are you able to try your test with a list of URLs that have IP addresses for hosts, so that DNS isn't invoked?

@seanmonstar
Copy link
Owner

Also, if possible, it'd be hugely helpful to see the stacks of the other threads. reqwest is hanging waiting for thread.join, so the other thread isn't finishing up for some reason.

@Shnatsel
Copy link

Shnatsel commented Jan 2, 2020

It is still reproducible using IP addresses only. What's more, I got another, different hang this way!

The first time around the very first IP 172.217.20.46 hung immediately for me, before it even downloaded the content. This might be a different bug?

After that I've done a bunch more runs with IPs and it does hang the same way it did with domain names. I've simply run the first 3000 entries from majestic_million.csv through nslookup to get the IP list.

@seanmonstar
Copy link
Owner

Well shucks. Is it possible to investigate the process when it hangs with gdb and see the stacks of the other threads?

@Shnatsel
Copy link

Shnatsel commented Jan 2, 2020

I've never learned to use gdb so I'm afraid I cannot provide anything meaningful here. And remote debugging through a human proxy is generally inefficient.

The code to reproduce the issue is linked above. If you need access to a multi-threaded machine, Google Cloud gives $300 free trial credit to everyone. You can just ssh into one of their VMs to debug this. I've already checked that the issue is reproducible there (on 4-core machine specifically).

@seanmonstar
Copy link
Owner

Ok, I managed to reproduce and get a backtrace of the other thread:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff70b3bb5 in __GI___pthread_mutex_lock (mutex=0x7ffff00023a0) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005555556d8704 in tokio::runtime::basic_scheduler::SchedulerPriv::schedule::hd36e54f11b45fb84 ()
#3  0x00005555555ea09f in tokio::task::waker::wake_by_val::h24030994c56cf063 ()
#4  0x00005555555e4482 in want::Taker::signal::hff5e4bb4f98d0e72 ()
#5  0x00005555556058c5 in core::ptr::real_drop_in_place::h5d7ab3cea2a38b95 ()
#6  0x0000555555608ee7 in core::ptr::real_drop_in_place::hcdf945c54d6363fa ()
#7  0x000055555560dd04 in tokio::task::core::Core$LT$T$GT$::transition_to_consumed::h97b8b14f387dce59 ()
#8  0x0000555555715aaa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#9  0x000055555565f172 in tokio::task::harness::Harness$LT$T$C$S$GT$::do_cancel::h823276277b2d0293 ()
#10 0x00005555555d1956 in tokio::task::queue::MpscQueues$LT$S$GT$::close_remote::h039153b4ef2d9c63 ()
#11 0x00005555555d1c7d in tokio::task::queue::MpscQueues$LT$S$GT$::shutdown::ha2ff88c001266a59 ()
#12 0x00005555555f5d3d in _$LT$tokio..runtime..basic_scheduler..BasicScheduler$LT$P$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::he8b1ff9afa3edcfc ()
#13 0x0000555555631748 in core::ptr::real_drop_in_place::h091f7f6e24d2b1f4 ()

@seanmonstar seanmonstar added the upstream Blocked: upstream. Depends on a dependency to make a change first. label Jan 6, 2020
@carllerche
Copy link

Looks like it is caused by this: tokio-rs/tokio#2058

@seanmonstar
Copy link
Owner

Should hopefully be fixed in the next version of Tokio (0.2.7).

@seanmonstar seanmonstar reopened this Jan 7, 2020
@im-n1
Copy link

im-n1 commented Jan 7, 2020

I have 2 threads and in each I run a blocking get(). I get deadlock 100% of the time.

bpetlert added a commit to bpetlert/pacman-mirrorup that referenced this issue Jan 7, 2020
@carllerche
Copy link

@im-n1 The fix should have been released today. Can you try now (make sure you have Tokio 0.2.7+).

@im-n1
Copy link

im-n1 commented Jan 8, 2020

Works with Tokio 0.2.8.

@karlri
Copy link
Author

karlri commented Jan 8, 2020

Works on Linux both on single core ARM and quadcore x86_64

@karlri karlri closed this as completed Jan 8, 2020
samford added a commit to samford/zola that referenced this issue Jan 8, 2020
The issue with the check_site test timing out seems to be related to
a similar reqwest issue (seanmonstar/reqwest#746). This was due to an
an upstream bug in tokio and should be fixed in tokio 0.2.7 onward.
samford added a commit to samford/zola that referenced this issue Jan 8, 2020
The issue with the check_site test timing out seems to be related to
a similar reqwest issue (seanmonstar/reqwest#746). This was due to an
upstream bug in tokio and may be fixed in tokio 0.2.7 onward.
@gnzlbg
Copy link

gnzlbg commented Jan 17, 2020

parking_lot primitives have deadlock detection that can be enabled with a cargo feature, so making sure that reqwest or tokio panic on a deadlock should be "as simple" as enabling that feature. The panic will output the backtraces of all threads involved in that deadlock.

@carllerche
Copy link

Thanks for the tip. Unfortunately, usually figuring out which threads are involved in the deadlock is the easy part. The hard part is figuring out why as the actual reason is almost never at the location of the deadlock.

@gnzlbg
Copy link

gnzlbg commented Jan 17, 2020

Sure, debugging deadlocks is hard.

I just meant to say that exposing such a feature would:

  • allow users that want to protect themselves against them and are willing to pay the runtime cost to do so (this turns deadlocks into panics, so deadlocks could be used to restart things)
  • help people fill better bugreports, because one often discovers a deadlock when the application does not terminate, but I at least often have the doubt of whether "is some work still ongoing" or "is this really a deadlock". Being able to turn a flag and reproduce with a backtrace is a huge time saver, and allows filling a more informative bug report. Although as you mention, from there to actually finding out what the problem is, the road is long.

What would be the appropriate place to discuss adding such a feature ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This is bug! upstream Blocked: upstream. Depends on a dependency to make a change first.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants