"unable to enqueue message" when `AsyncClient<UdpResponse>` sends too many requests #1276

LEXUGE · 2020-11-08T12:11:18Z

Describe the bug
During pressure tests for scale like 890 qps, I found

2020-11-08 18:24:11,428 DEBUG [trust_dns_proto::xfer] enqueueing message: [Query { name: Name { is_fqdn: true, labels: [baidu, com] }, query_type: A, query_class: IN }]
2020-11-08 18:24:11,428 DEBUG [trust_dns_proto::xfer] unable to enqueue message
2020-11-08 18:24:11,428 WARN  [droute::router] Upstream encountered error: could not send request, returning SERVFAIL
2020-11-08 18:24:11,428 DEBUG [trust_dns_proto::xfer::dns_exchange] io_stream is done, shutting down

where droute is the name of my project.

To Reproduce
Hard to get down to a minimal reproducible code snippet. However, I did the following:

Receive DNS query in an event loop. Spawn a new task for each query.
For each task, clone the AsyncClient and send through it.

There is only one AsyncClient, but it is cloned for several times.
I tested with delay in between, 1 millisecond doesn't help (issue persists for unable to enqueue), 2 milliseconds result in timeout (I set timeout for like 2 seconds per query).
I also tested to have multiple AsyncClient, which results in high rate of timeout.

Code related can be found here

Expected behavior
No error

System:

OS: [e.g. macOS]
Architecture: [e.g. x86_64]
Version [e.g. 22]
rustc version: [e.g. 1.28]

Version:
Crate: client
Version: 0.19.5

Additional context
I also used tokio-compat, but it doesn't seem to occur the issue.

The text was updated successfully, but these errors were encountered:

LEXUGE · 2020-11-08T23:27:33Z

It might be caused by CHANNEL_BUFFER_SIZE. Can we have an option to have unbounded mpsc channel?

LEXUGE · 2020-11-09T07:24:09Z

I found out this is not related to buffer size, but I used async Mutex across send which causes the problem. I don't know why holding async Mutex across causes the problem. But it is solved.

djc · 2020-11-09T12:51:04Z

Can you explain a bit more about the changes that fixed the problem you were seeing? It sounds like there might be a potential deadlock in trust-dns that you triggered?

LEXUGE · 2020-11-09T13:00:10Z

Sure, here is the problematic code. I have roughly an async Mutex<HashMap<usize, AsyncClient<UdpResponse>, and I locked it across the send().await operation. After a while of running free of error, it starts to time out, and finally gets me unable to enqueue: couldn't find receivers or receiver is gone (I don't remember exactly).

Either block the async Mutex before and after the send operation or using sync Mutex helps me mitigate the issue. In conclusion, I can't hold MutexGuard<AsyncClient> across send().await, else it causes the issue.

Also, unbounded or bounded mpsc is not related to this issue as I tried to use unbounded channel, getting the same issue. In addition to that, both 0.19.5 and 0.20-alpha3 get me the issue with same error: receiver is gone.

HashMap<usize, AsyncClient> is like a correspondence between the configurations user specified and the clients. It's only 5 or 4 keys.

djc · 2020-11-09T13:02:21Z

Why are you holding on to so many separate AsyncClient instances in the first place?

LEXUGE · 2020-11-09T13:04:03Z

No, I am not. I created a HashMap to "cache" AsyncClient for different configuration. I cloned the same AsyncClient for each incoming query to reduce the time needed for creating an extra client.

LEXUGE · 2020-11-29T05:57:24Z

I tested the same code on with Tokio 0.3 on main branch, seems like it is able to enqueue messages, but it has to wait for a long time. Can we change the channel to be unlimited? Or have an option for that

djc · 2020-11-29T11:55:02Z

Effectively this is the AsyncClient applying backpressure towards your application, because the internal queue isn't being emptied fast enough. I think the solution here is (a) investigating what the performance problem is on the other side of the queue, or (b) keep a buffer in your application. We could make the channel larger, but in the limit that would just mean that your queries start to time out.

LEXUGE · 2020-11-29T11:58:30Z

Does having multiple AsyncClient help? I suppose each client has its own queue (background)?

djc · 2020-11-29T12:01:00Z

If you set up multiple async clients separately that should mitigate the problem, yes. But again, that just means now you're potentially opening multiple connections to the same backend servers, which might cause your client to get rate-limited sooner.

It would be useful to know what the bottleneck in draining the queue is, in your application.

LEXUGE · 2020-11-29T12:03:16Z

Thanks, I would investigate further to see where exact point is.

djc · 2020-11-29T12:10:27Z

dcompass looks like a cool project, BTW!

Is there a particular reason you're using trust-dns-client here rather than trust-dns-resolver?

LEXUGE · 2020-11-29T12:13:13Z

To tell you the truth, the first prototype used resolver because I didn't understand how the client works. However, that means I have to refill the IP response back to create a new packet and send back, which is cumbersome. Using client means I only need to forward query and send back the answers, which is kind of more native.

LEXUGE · 2020-11-29T23:44:33Z

Seems like it's a network issue only. It now can top 3000 qps!

djc · 2020-11-30T08:55:48Z

So this can be closed again, right?

LEXUGE · 2020-11-30T10:16:44Z

So this can be closed again, right?

I think so. However, regarding the queue size, I hope it could be increased or exposed with an argument.

djc · 2020-11-30T10:45:59Z

I don't really see a good reason to do that: if there's a network delay for example, it's better for your application to become aware of that sooner rather than later (through timeouts).

LEXUGE · 2021-01-14T11:02:47Z

Update:
Although backpressure is expected, but on 0.20 the underlying channel seems always full after once it's filled. i.e. it's irrecoverable. I suspect this is a bug.

djc · 2021-01-14T11:07:30Z

Would be nice to have a minimal reproduction that demonstrates the issue.

LEXUGE · 2021-01-14T14:18:15Z

use std::net::SocketAddr;
use std::sync::Arc;
use tokio::net::UdpSocket;
use trust_dns_client::{client::AsyncClient, udp::UdpClientStream};
use trust_dns_proto::op::Message;
use trust_dns_proto::xfer::dns_handle::DnsHandle;

#[tokio::main]
async fn main() {
    // Bind an UDP socket
    let socket = Arc::new(
        UdpSocket::bind("127.0.0.1:2053".parse::<SocketAddr>().unwrap())
            .await
            .unwrap(),
    );

    let client = {
        let stream = UdpClientStream::<UdpSocket>::new("8.8.8.8:53".parse().unwrap());
        let (client, bg) = AsyncClient::connect(stream).await.unwrap();
        tokio::spawn(bg);
        client
    };

    // Event loop
    loop {
        let mut buf = [0; 1232];

        let (_, src) = socket.recv_from(&mut buf).await.unwrap();

        let msg = Message::from_vec(&buf).unwrap();
        let socket = socket.clone();

        let mut client = client.clone();
        tokio::spawn(async move {
            let id = msg.id();
            let mut r = Message::from(client.send(msg).await.unwrap());
            r.set_id(id);
            socket.send_to(&r.to_vec().unwrap(), src).await
        });
    }
}

I believe this is the minimal sample. However, due to having a "good" network environment now, I cannot really test it out. (It works under current environment).

djc · 2021-01-14T14:27:06Z

I don't think it would be a bug if that fails. The application needs to be able to "handle" backpressure from the underlying library, by slowing down the rate of requests if necessary. The channel always being full just means the receiver cannot handle the rate of requests faster than the sender is trying to send it. If that happens with your application, you should find some way for your application to "handle" the backpressure, for example by temporarily delaying further requests or load balancing with another channel.

Maybe do some reading on backpressure if you're not familiar with it:

https://medium.com/@jayphelps/backpressure-explained-the-flow-of-data-through-software-2350b3e77ce7

LEXUGE · 2021-01-14T23:33:12Z

However, the situation here is that the channel is full even though the sender stopped sending any message for a considerable amount of time. That is what I unexpected. I think by either cancelling or dropping or handling internally in trust-dns, it should be able to empty the channel somehow.

LEXUGE · 2021-01-14T23:40:32Z

For example, after I stressed the requestor out, it is full even after 1 minute as I tested for www.example.cn. (I tested for longer intervals as well).

But I cannot reproduce on that minimal sample still. Maybe this issue is related to my codebase, however, I don't see any substantial difference between the sample and my codebase.

djc · 2021-01-15T09:13:54Z

Okay, that does sound look a bug. Without some way to reproduce it or a more detailed problem report, I'm not sure I have any avenues for fixing it, though.

LEXUGE · 2021-01-16T09:07:04Z

I tweaked the logging a little and found this bizarre situation.
This might be related to my codebase. I am not sure if I misused the UdpClientStream. What I did is to create a client for once and clone it for later usages.

impl Udp {
    pub async fn new(addr: SocketAddr) -> Result<Self> {
        let stream = UdpClientStream::<UdpSocket>::new(addr);
        let (client, bg) = AsyncClient::connect(stream).await?;
        tokio::spawn(bg);
        Ok(Self { client })
    }
}

#[async_trait]
impl ClientPool for Udp {
    async fn get_client(&self) -> Result<AsyncClient> {
        Ok(self.client.clone())
    }
}

It's weird to see receiver being dropped though.

LEXUGE · 2021-01-16T09:20:40Z

Seems like before the receiver was gone, there was always some error like failed to associate send_message response to the sender. This may cause background and receiver to drop, resulting in this issue.

LEXUGE · 2021-01-17T04:52:59Z

The root cause of this issue is that the background task encountered some error and exited, while the client tried to send the message but failed due to non-existed background. What I did is to create a new AsyncClient whenever it fails.

djc · 2021-01-17T13:08:45Z

Did you figure out what error the background task encountered? We should maybe make the background task more resilient to failure if the error is at all recoverable.

LEXUGE · 2021-01-17T13:12:53Z

Mainly failed to associate send_message response to the sender.
For HTTPS clients, there may also be io_stream hit an error, shutting down: not an error (approximately as underlying streams get closenotify).

bluejekyll · 2021-01-18T00:26:03Z

Thanks for all the research on this. I'm wondering if the issue here is that we are hitting a network error, but the request future waiting for the result isn't able to be notified, b/c there is no ID associated back to the stream (thus forcing the timeout to expire before resolving itself).

Do we need a better method of binding the request id and IO Stream associated together, such that when the IO Stream fails we can immediately return a result to the channel waiting for a response? (I'm guessing this might be the issue based on @LEXUGE 's research).

LEXUGE · 2021-01-18T00:35:01Z

There are two cases separately.
One is that the background encountered some PERMANENT errors, in that case AsyncClient should make the caller aware of the situation or destruct itself maybe, since a client without background makes no sense.
Another case is that the AsyncClient irresponsibly being dropped. In that way, I expect the background exit by itself (as it does currently).

However, if the error is transient (like a network issue, if the background is able to tell), I expect the background to survive through those errors.

djc · 2021-01-18T08:01:11Z

The behavior in the case of failing to associate seems wrong; see if #1356 improves the situation?

For the other one, is that the exact message? I cannot find where the "not an error" phrase would have come from.

LEXUGE · 2021-01-18T08:44:41Z

The behavior in the case of failing to associate seems wrong; see if #1356 improves the situation?

For the other one, is that the exact message? I cannot find where the "not an error" phrase would have come from.

It is not exact. The exact message (for DoH clients) is io_stream hit an error, shutting down: h2 stream errored: protocol error: not a result of an error. This is actually not an error I suppose as this is part of the action taken by h2 for CloseNotify.

LEXUGE · 2021-01-18T08:48:56Z

And for failing to associate, that PR doesn't seem to be the right behavior either as I pointed out. Currently, if there are multiple AsyncClients and a single DnsExchangeBackground, and one of the AsyncClient sent the query but went away, then the DnsExchangeBackground dies out. However, with that PR, even if all the clients are dropped, the background still carries on, which is not that right though.

Can we figure out how to let the background exit if and only if all clients are dropped or some irrecoverable errors encountered (better to let all the clients know in that case).

djc · 2021-01-18T08:59:13Z

That's not how I understand the change I made. It seems to me that the change only lets the task go on if sending a response to a receiver fails. However, the DnsExchangeBackground will also check for the outbound_messages queue of requests, and if that goes empty outbound_messages.as_mut().poll_next(cx) = Poll::Ready(None) the underlying stream will shutdown, and the DnsExchangeBackground will return Poll::Ready(Ok(())) once shutdown is complete. This should take care of the scenario you describe.

LEXUGE · 2021-01-18T10:41:00Z

Thanks, that seems right to me now.

LEXUGE closed this as completed Nov 9, 2020

LEXUGE reopened this Nov 12, 2020

LEXUGE closed this as completed Jan 17, 2021

djc reopened this Jan 17, 2021

djc added a commit that referenced this issue Jan 18, 2021

Don't kill a DnsExchangeBackground if a receiver is gone (see #1276)

994f81f

djc added a commit that referenced this issue Jan 18, 2021

Don't kill a DnsExchangeBackground if a receiver is gone (see #1276)

2313e4d

djc mentioned this issue Jan 18, 2021

Don't kill a DnsExchangeBackground if a receiver is gone (see #1276) #1356

Merged

djc added a commit that referenced this issue Jan 19, 2021

Don't kill a DnsExchangeBackground if a receiver is gone (see #1276)

4513d79

bluejekyll closed this as completed in #1356 Jan 19, 2021

bluejekyll pushed a commit that referenced this issue Jan 19, 2021

Don't kill a DnsExchangeBackground if a receiver is gone (see #1276)

06e4579

This was referenced Mar 16, 2021

chore(deps): bump trust-dns-server from 0.20.0 to 0.20.1 conblem/acme-dns-rust#117

Merged

Bump trust-dns-resolver from 0.20.0 to 0.20.1 lukaspustina/mhost#629

Closed

build(deps): bump trust-dns-proto from 0.20.0 to 0.20.1 compassd/dcompass#48

Closed

"unable to enqueue message" when AsyncClient<UdpResponse> sends too many requests #1276

"unable to enqueue message" when AsyncClient<UdpResponse> sends too many requests #1276

Comments

LEXUGE commented Nov 8, 2020

LEXUGE commented Nov 8, 2020

LEXUGE commented Nov 9, 2020

djc commented Nov 9, 2020

LEXUGE commented Nov 9, 2020 • edited

djc commented Nov 9, 2020

LEXUGE commented Nov 9, 2020

LEXUGE commented Nov 29, 2020

djc commented Nov 29, 2020

LEXUGE commented Nov 29, 2020 • edited

djc commented Nov 29, 2020

LEXUGE commented Nov 29, 2020

djc commented Nov 29, 2020

LEXUGE commented Nov 29, 2020

LEXUGE commented Nov 29, 2020

djc commented Nov 30, 2020

LEXUGE commented Nov 30, 2020 • edited

djc commented Nov 30, 2020

LEXUGE commented Jan 14, 2021

djc commented Jan 14, 2021

LEXUGE commented Jan 14, 2021

djc commented Jan 14, 2021 • edited

LEXUGE commented Jan 14, 2021

LEXUGE commented Jan 14, 2021 • edited

djc commented Jan 15, 2021

LEXUGE commented Jan 16, 2021

LEXUGE commented Jan 16, 2021

LEXUGE commented Jan 17, 2021 • edited

djc commented Jan 17, 2021

LEXUGE commented Jan 17, 2021

bluejekyll commented Jan 18, 2021

LEXUGE commented Jan 18, 2021

djc commented Jan 18, 2021

LEXUGE commented Jan 18, 2021

LEXUGE commented Jan 18, 2021

djc commented Jan 18, 2021

LEXUGE commented Jan 18, 2021

"unable to enqueue message" when `AsyncClient<UdpResponse>` sends too many requests #1276

"unable to enqueue message" when `AsyncClient<UdpResponse>` sends too many requests #1276

LEXUGE commented Nov 9, 2020 •

edited

LEXUGE commented Nov 29, 2020 •

edited

LEXUGE commented Nov 30, 2020 •

edited

djc commented Jan 14, 2021 •

edited

LEXUGE commented Jan 14, 2021 •

edited

LEXUGE commented Jan 17, 2021 •

edited