Optimizing latency? #82

dpc · 2020-12-20T03:41:37Z

I was just recently investigating an issue caused by https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment and it occurred to me to investigate latency optimization in the project I use attohttpc in.

I think right now the library doesn't reuse the connection at all. Should there be a way to do it in the future or is it already out of scope of this library?

Since there's no connection reuse, is delayed ACK etc. a problem at all? Would it make sense for attohttpc to set (or allow setting) nodelay + use a large buffer and flush it at the end?

https://doc.rust-lang.org/std/net/struct.TcpStream.html#method.set_nodelay

The text was updated successfully, but these errors were encountered:

sbstp · 2020-12-20T16:46:51Z

At the moment I don't have plans to implement connection pooling. I haven't looked at using TCP no delay either. It's possible that latency could be lowered by using a BufWriter and buffering the request line and headers. I'd be open to a change that implements this if latency is indeed improved by the change.

dpc · 2020-12-20T17:43:49Z

I'm doing my own research here (though I used to know it better long time ago), so take everything with a grain of salt.

Re: connection re-use, I totally understand not adding it, and in many setups making a new connection is ~1ms thing, so it truly doesn't matter much.

The tcp delaying sending data is a bigger problem. I am not sure if attohtpc, the way it is being used now is being affected. I did seen delays in the past in other software.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/reducing_the_tcp_delayed_ack_timeout

It seems like the socket flags help on some setups. Without connection polling, i would expect simply closing shutting down the socket (the sending side) with https://doc.rust-lang.org/std/net/struct.TcpStream.html#method.shutdown after all the data had been written had been written, would work as a flush. Maybe. I would hope.

Using buffered writer would potentially save on syscalls, and with nodelay would fix this hypotetical issue.

Seems like other http client libraries are doing nodelay:

dpc · 2020-12-20T17:53:09Z

Also interesting read: https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable , though considering that after shutdown the client would read the response, it would take care of all these through that.

sbstp · 2020-12-29T04:57:56Z

Would an option to enable tcp no delay be okay with you? We let the users choose what best fits their use case.

adamreichold · 2020-12-29T10:06:31Z

Would an option to enable tcp no delay be okay with you? We let the users choose what best fits their use case.

Personally, I think this is a compromise at best as disabling Nagle's algorithm also means sending out half-full packets when the default buffer size (the 8K used by std::io::BufWriter in our case) does not match the TCP segment size (which is dynamically adjusted).

I think using TCP_CORK would be what we want here: Set it before starting to write out the request using buffer writing (only to decrease system call overhead). If our request is large we'll send as many full packets as we can while pushing it into the kernel. (Nagle will not delay full packets.) After we are undone, uncork the socket so any pending data will get sent immediately (even if it means the last packet is half-full but that cannot be helped).

As we currently use one connection per request, i.e. no pipelining is possible, corking and uncorking will never be wrong and we would not need to expose an option. The main downside is that it is a Linux-specific API. Another issue is that it requires unsafe code to call into libc::setsockopt directly AFAIK. My initial testing suggests that this can reduce latency measurably, especially its variance.

(Setting TCP_NODELAY exactly before the last write system call, i.e. before flushing the BufWriter could achieve the same effect AFAIU, but this has one pathological case: If we just wrote a full buffer and it is empty, there will be no more system call (with more than zero bytes) and hence the last buffers written might still be delayed. But that would require that the full request length is a multiple of 8K bytes. Will try that next...)

adamreichold · 2020-12-29T10:30:20Z

It is hard to measure this using a real network due to noise, but it seems like using TCP_NODELAY is almost always as a good as using TCP_CORK without the portability issues.

I would therefore suggest calling set_nodelay(true) right before calling writer.flush() unconditionally instead of adding an option to enable it unconditionally.

The edge case of the full request length hitting a multiple of the buffer size exactly seems unfortunate, but I am not sure fixing that is worth the downsides of integrating TCP_CORK even though that always does exactly the right thing.

adamreichold · 2020-12-29T13:29:11Z

Ok, I have done some more measurements using a container running kennethreitz/httpbin with

tc qdisc add dev eth0 root netem delay 1s

to enforce a 1s delay before any response packet is sent.

At least running kernel 5.10, I see no measurable effect of using TCP_NODELAY or TCP_CORK when using a std::io::BufWriter with its default 8kB buffer size. At least for our one connection per request, the kernel seems to have sufficient heuristics to do the right thing (e.g. driven by us doing our first read immediately after the last write).

So at least with contemporary kernels, it does not seem worth it to try to second guess the TCP stack. The interaction of Nagle's algorithm and delayed acknowledgements does not seem to be an issue either as the first segment is never delayed but our request line is always the first segment as we do not reuse connections.

@dpc Could you describe a specific setup which one could reproduce to investigate your initial issue?

P.S.: So concerning your initial questions:

Since there's no connection reuse, is delayed ACK etc. a problem at all?

As written above, I think this is not a problem at all indeed.

Would it make sense for attohttpc to set (or allow setting) nodelay + use a large buffer and flush it at the end?

It's possible that latency could be lowered by using a BufWriter and buffering the request line and headers.

We already use Rust's default 8 kB buffer and flush it at the end. Setting TCP_NODELAY immediately would only make things worse as maximum segment size and buffer size can be mismatched. I think that setting TCP_NODELAY only before the last write might be useful on older kernels but seems unnecessary on contemporary ones.

P.P.S.: One interesting effect happens when data is larger than the internal buffer. Rust's std::io::BufWriter will avoid copying those large writes into its internal buffer and flushes the request line and headers first instead to then directly write to the socket in 8 kB increments. This means when the body is really large, we can end up sending the request line and headers in a separate packet from the first part of the body. But then again, I do not see how this would imply any additional delays as long as the maximum segment size is below 8 kB as the second write happens immediately after the first with no chance of blocking. It will add the overhead of an additional packet which TCP_NODELAY will not avoid as it will not delay the first segment, but this should not matter if the body is larger than 8 kB. Using TCP_CORK does resolve this properly, but it seems hardly worth the integration issues for a library that does one request per connection.

sbstp · 2020-12-29T18:38:13Z

Thanks for investigating. Behavior might be different on Mac/Windows, it might be worth setting TCP_NODELAY for those. Since we use a BufWriter it would probably be fine to do it. But I'm happy with letting the OS handle it, they surely have optimizations for this case.

adamreichold · 2020-12-29T18:48:19Z

it might be worth setting TCP_NODELAY for those. Since we use a BufWriter it would probably be fine to do it.

But only before the last write system call then? Internal buffer size and maximum segment size not matching up looks like an issue with how TCP works to me, not with the implementation of TCP that is part of the Linux kernel.

sbstp · 2020-12-29T18:56:24Z

I'm not sure I understand the issue with the maximum segment size and the BufWriter. The BufWriter buffers up to 8kb, so when we flush it should always result in full segments except for the last one.

I'm mostly thinking of how we write the headers here. I haven't checked how we write the body.

adamreichold · 2020-12-29T19:09:08Z

I'm mostly thinking of how we write the headers here. I haven't checked how we write the body.

Everything goes through the same BufWriter meaning we flush request line, headers and if applicable the beginning of the body together. If your request is small, this is good as the complete request including the body will be sent in a single packet. If your request is large, this is not an issue as the first packet will still contain request line and headers.

I'm not sure I understand the issue with the maximum segment size and the BufWriter. The BufWriter buffers up to 8kb, so when we flush it should always result in full segments except for the last one.

The problem is that usually each write system call (say handing over 8 kB of data to the kernel) will result in multiple packets being sent, each up to the maximum segment size (MSS). But 8 kB is also usually not a multiple of the MSS, so each write will produce many full seqments and one less than full segment.

If TCP_NODELAY is set, this last-less-than full segment will be sent immediately instead of being filled up by the next write system call. If it is not set, the next 8 kB will first fill up that segment and then produce multiple full segments and possibly one less-than-full segment.

Only for the very last write system call, we want this last less-than-full segment to be sent immediately. Because we know that we will not write anymore data. (Which is also why we would want to "uncork" the socket after that last write.)

Maybe as an illustration where W means write system call issued, F means full segment sent and L means less-than-full segment sent, writing a large request with TCP_NODELAY set might look like

W       W       W
F F F L F F F L F L

whereas without it, we will get

W     W     W
F F F F F F F L

sbstp · 2020-12-29T19:11:44Z

Ah yes, thank you. I see the issue.

Best to just leave it like it is now then. Like you said before, the kernels should be smart enough to flush the nagle buffer once we try reading from the socket.

dpc · 2021-01-02T02:57:23Z

@dpc Could you describe a specific setup which one could reproduce to investigate your initial issue?

Just to be clear - I have no issue with my Rust project. We've been suspecting an issue in another project (that was reusing plain tcp connection for multiple requests), and it made me think about the Rust project I'm using attohttpc where latency is also somewhat important (but we never did any measurements, nor was latency ever an issue).

So, I just wanted to put this into consideration.

Thanks! If there's no actionable items, we might just close this issue.

adamreichold · 2021-01-02T12:04:23Z

So yes, I think we should close this for now. The whole question needs reconsideration if we start to reuse connections.

dpc changed the title ~~Optimizing for latency?~~ Optimizing latency? Dec 20, 2020

sbstp closed this as completed Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing latency? #82

Optimizing latency? #82

dpc commented Dec 20, 2020

sbstp commented Dec 20, 2020

dpc commented Dec 20, 2020

dpc commented Dec 20, 2020

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020

adamreichold commented Dec 29, 2020 •

edited

adamreichold commented Dec 29, 2020 •

edited

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020 •

edited

sbstp commented Dec 29, 2020

dpc commented Jan 2, 2021

adamreichold commented Jan 2, 2021

Optimizing latency? #82

Optimizing latency? #82

Comments

dpc commented Dec 20, 2020

sbstp commented Dec 20, 2020

dpc commented Dec 20, 2020

dpc commented Dec 20, 2020

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020

adamreichold commented Dec 29, 2020 • edited

adamreichold commented Dec 29, 2020 • edited

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020

sbstp commented Dec 29, 2020

adamreichold commented Dec 29, 2020 • edited

sbstp commented Dec 29, 2020

dpc commented Jan 2, 2021

adamreichold commented Jan 2, 2021

adamreichold commented Dec 29, 2020 •

edited

adamreichold commented Dec 29, 2020 •

edited

adamreichold commented Dec 29, 2020 •

edited