Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnixStream::connect() doesn't handle EPOLLHUP #3879

Closed
bnoordhuis opened this issue Jun 21, 2021 · 11 comments · Fixed by #3898
Closed

UnixStream::connect() doesn't handle EPOLLHUP #3879

bnoordhuis opened this issue Jun 21, 2021 · 11 comments · Fixed by #3898
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-net Module: tokio/net

Comments

@bnoordhuis
Copy link
Contributor

Version
1.6.1 - issue appears to be present in ToT as well though

Platform
Linux bender 5.8.0-53-lowlatency #60-Ubuntu SMP PREEMPT Thu May 6 08:53:53 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description

I regret that I don't have a small reproduction but it only pops up infrequently. The program does quite literally this:

let mut conn = tokio::net::UnixStream::connect(path).await?;
conn.write_all(data).await?;

Sporadically, this happens:

2278274 connect(15<UNIX:[62343351]>, {sa_family=AF_UNIX, sun_path="<elided>"}, 60) = -1 EAGAIN (Resource temporarily unavailable)
2278274 epoll_ctl(8<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 15<UNIX:[62343351]>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=587202568, u64=587202568}}) = 0
2278274 epoll_wait(4<anon_inode:[eventpoll]>, [{EPOLLIN, {u32=2147483648, u64=2147483648}}, {EPOLLOUT|EPOLLHUP, {u32=587202568, u64=587202568}}], 1024, 4) = 2

Note the EPOLLOUT|EPOLLHUP. If I read Tokio's source code right, it considers the presence of EPOLLOUT to mean success and ignores the EPOLLHUP. The connecting process then proceeds to write but that fails as follows:

2278274 write(15<UNIX:[62343351]>, "1\n", 2) = -1 ENOTCONN (Transport endpoint is not connected)

My expectation is that UnixStream::connect() fails with an error when EPOLLHUP is set.

@bnoordhuis bnoordhuis added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Jun 21, 2021
@Darksonn Darksonn added the M-net Module: tokio/net label Jun 22, 2021
@Thomasdezeeuw
Copy link
Contributor

I'm not too familiar with Tokio's source code, @Darksonn do you know if PollEvented::poll_write_ready checks mio::Event::is_write_closed? Mio should return true if EPOLLHUPis set for mio::Event::is_write_closed.

@Darksonn
Copy link
Contributor

Hm, it's rather difficult to trace whether it does.

@carllerche
Copy link
Member

I believe that poll_write_ready does not do anything special on HUP. The key here is a connect should fail, not all calls to writable().

@Thomasdezeeuw
Copy link
Contributor

Looking at:

poll_fn(|cx| stream.io.registration().poll_write_ready(cx)).await?;
Ok(stream)

It doesn't seem to check any errors set on the stream, something that TcpStream does do:

poll_fn(|cx| stream.io.registration().poll_write_ready(cx)).await?;
if let Some(e) = stream.io.take_error()? {
return Err(e);
}
Ok(stream)

Could that be the problem? In any case I think epoll/Mio should return an error that is detectable (somewhere) by Tokio before the stream is returned by UnixStream::connect.

@Darksonn
Copy link
Contributor

I opened #3898 to make the change that @Thomasdezeeuw suggested.

@bnoordhuis
Copy link
Contributor Author

I found out this weekend that EPOLLHUP isn't necessarily a terminal state - it can mean the listen socket's accept backlog is full and that it'll start accepting new connections again once the backlog clears.

I've never seen this issue with libuv (I'm a maintainer) but maybe that's because libuv doesn't register the file descriptor with EPOLLRDHUP.

I can try removing EPOLLRDHUP from Tokio/mio and see if that makes a difference, or do you expect that to break things?

@Thomasdezeeuw
Copy link
Contributor

I found out this weekend that EPOLLHUP isn't necessarily a terminal state - it can mean the listen socket's accept backlog is full and that it'll start accepting new connections again once the backlog clears.

I'm a little confused; the original issue is about UnixStream, which doesn't have a backlog. Do mean the backlog of the socket the UnixStream::connect is connecting to?

Also can mean sounds rather troublesome. How would we detect a "real" problem from a backlog is full try again problem? Can use take_error (SO_ERROR) for this?

I've never seen this issue with libuv (I'm a maintainer) but maybe that's because libuv doesn't register the file descriptor with EPOLLRDHUP.

We use it to determine if the read side is closed: https://docs.rs/mio/0.7.13/mio/event/struct.Event.html#method.is_read_closed.

I can try removing EPOLLRDHUP from Tokio/mio and see if that makes a difference, or do you expect that to break things?

Well it would break the public API of Mio for sure. If you want to try the relevant code is here: https://github.com/tokio-rs/mio/blob/21ddf94d3191dc2385d53ca8f2ac5a7b5ca5af96/src/sys/unix/selector/epoll.rs#L136. However I think it should handled in Tokio as other users might depend on this flag.

@bnoordhuis
Copy link
Contributor Author

Do mean the backlog of the socket the UnixStream::connect is connecting to?

Yes, that's right.

Also can mean sounds rather troublesome. How would we detect a "real" problem from a backlog is full try again problem? Can use take_error (SO_ERROR) for this?

I don't know TBH but I'll investigate. :-)

@Thomasdezeeuw
Copy link
Contributor

I'm sure this should be closed yet. Yes the issue itself is resolved, but @bnoordhuis suggested we might be able to recover from this error. Maybe we should wait until he has had time to investigate further.

@carllerche
Copy link
Member

I think Github automatically closed it.

@bnoordhuis
Copy link
Contributor Author

Do mean the backlog of the socket the UnixStream::connect is connecting to?

Yes, that's right.

Also can mean sounds rather troublesome. How would we detect a "real" problem from a backlog is full try again problem? Can use take_error (SO_ERROR) for this?

I've been trying to investigate this but the application I'm working on has changed significantly enough that the original issue no longer appears and trying to mold it back to its previous incarnation was too much of a time sink. I'm good with closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-net Module: tokio/net
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants