Replace tokio-timer with futures-timer for higher timer resolution #5

doyoubi · 2019-10-23T13:42:58Z

Current timer resolution is low for some use cases: #4

This PR replaces the tokio-timer with futures-timer for higher resolution.

I still need to compare their implementation to be clear about the following questions:

(1) Why tokio-timer does not support higher resolution timer?
(2) Is there any problem with the futures-timer for higher resolution than tokio-timer?
(3) What's the resolution of futures-timer?

doyoubi · 2019-10-23T13:46:31Z

Cargo.toml

@@ -9,7 +9,7 @@ edition = "2018"
 [dependencies]
 tokio = "0.1.14"
 futures = "0.1.25"
-tokio-timer = "0.2.8"
+futures-timer = "0.1"


Any higher version requires changing from futures::Future to std::future::Future.

doyoubi · 2019-10-25T10:48:20Z

(1) Why tokio-timer does not support higher resolution timer?
https://tokio.rs/blog/2018-03-timers/

(2) Is there any problem with the futures-timer for higher resolution than tokio-timer?
For waiting, futures-rs uses thread::park_timeout while tokio-timer uses Park::park_timeout, both of which have no guarantee on the resolution. But in reality, futures-rs is likely to have higher resolution.

(3) What's the resolution of futures-timer?
It should depend on the underlying OS. In my Linux environment, it has a granularity of about 10 microseconds.

For my use case on batching syscall, replacing the timer to futures-timer does not work.
When the duration is about 20 microseconds, the Delay will trigger lots of conditional variable calls that ruin the performance.
My solution is to use two timers with different timeout parameters, which may be too specific to my use case.

mre · 2019-10-25T14:50:22Z

Thanks for the summary.

When the duration is about 20 microseconds, the Delay will trigger lots of conditional variable calls that ruin the performance.

I'd be interested in a more detailed explanation on this
Did you benchmark that? What are the conditionals exactly? Is there a code snippet I could look at? Are the maintainers aware or shall we create an issue to discuss possible options on their side?

My solution is to use two timers with different timeout parameters, which may be too specific to my use case.

Glad you found a solution. If you'd like to continue working on your WIP, you'd be more than welcome to do so. A higher-resolution timer might be beneficial in many other cases.

doyoubi · 2019-10-28T12:48:14Z

Later I will post more details on the performance problem of the high-resolution timer with a low timeout.

Did you benchmark that?
What are the conditionals exactly?

I used the modified tokio-batch in my Redis proxy which is a level 7 network protocol proxy and benchmark it. In my sampled flamegraph I can see a bunch of conditional variable calls that may come from tokio runtime for implementing waking up tasks.

Is there a code snippet I could look at?

I will write some sample codes to reproduce that.

Are the maintainers aware or shall we create an issue to discuss possible options on their side?

This may be a common problem for any coroutine runtime. We also encountered this problem before in the Golang runtime. Low timeout value will trigger lots of wake-up operations, resulting in a bunch of pthread_cond_wait or pthread_cond_timedwait in the runtime. I think it's more a usage problem for users.

mre · 2019-10-28T13:06:19Z

You've been running the benchmarks in release mode or? (Just to be sure. 😉)

Might be wrong, but I think futures-timer uses parking_lot under the hood, right?
Think this is the pthread_cond_wait you're talking about.
There's also a discussion about Futures support here.

Thanks for your investigation. Would be really curious about your writeup. You could also raise this on the user forum or the subreddit if you like.

doyoubi · 2019-10-28T13:26:34Z

You've been running the benchmarks in release mode or? (Just to be sure. 😉)

Yes, it's in release mode.

Might be wrong, but I think futures-timer uses parking_lot under the hood, right?

Seems like in 0.1 it does not depend on the parking_lot crate: Cargo.toml of 0.1.

doyoubi · 2019-11-02T12:05:40Z

Remove tokio-timer so we don't need futures01 anymore.
Add an additional light-weight passive timer to optimize the case that timeout duration needs to be very small.

TODO:

More explanation
Benchmark

doyoubi · 2019-11-09T06:57:00Z

It seems like in the new tokio runtime, the performance problem of pthread_cond_wait no longer exists!
What we need to do here is just replacing the tokio-timer with futures-timer.

Here's an example: tokio-batch-benchmark.
I wrote a simple Redis server with my customedtokio-batch, so that we can easily use redis-benchmark to benchmark the result.

Build the example

$ git clone https://github.com/doyoubi/tokio-batch-benchmark.git
$ cd tokio-batch-benchmark
$ make

Benchmark without batching

# Run the server without batching
$ make nobatch

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 230771.00 requests per second
PING_BULK: 232780.09 requests per second

Benchmark with just one timer driven by futures-timer

# Run the server with just a timer
$ make max

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 1338329.75 requests per second
PING_BULK: 1335470.12 requests per second

Benchmark with two timers, a light-weight timer and a timer driven by futures-timer

# Run the server with the light-weight timer
$ make minmax

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 1498351.88 requests per second
PING_BULK: 1368176.25 requests per second

Adding the light-weight timer has not significant difference.

By the way:

I've found a performance problem in SplitSink of futures-rs which disables the batching: issue. This example uses a customized SplitSink to work around it.

UPDATE:
It has been fixed, but not released yet.

It's even 2 times faster than Redis on my Mac.

mre · 2019-11-09T10:26:03Z

Wow. That's great news! Looking forward to review and merge this. I'm at RustFest at the moment so it'll take some time until I get to it but I'll try my best to get it in.

mre · 2019-11-17T11:18:02Z

For comparison, here are my numbers:

# make nobatch
PING_INLINE: 192927.28 requests per second
PING_BULK: 193031.56 requests per second

# make max
PING_INLINE: 213867.14 requests per second
PING_BULK: 216104.08 requests per second

# make minmax
PING_INLINE: 249731.55 requests per second
PING_BULK: 250237.72 requests per second

(Macbook Pro Mid 2015, 2.2 GHz Intel Core i7, 16 GB RAM)

doyoubi · 2019-11-17T11:23:28Z

How do you run the redis-benchmark? It needs the -P <pipline number larger than 1> to enable pipelining:

$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 3

Have you pulled the latest codes of tokio-batch-benchmark?

Try running this to check whether there is any error message:

$ redis-benchmark -e -n 10000000 -t ping -c 32 -P 3

I will also check it on other computers.

mre · 2019-11-17T11:28:00Z

Yup, just like you described:

redis-benchmark -q -n 10000000 -t ping -c 32 -P 3

(In all three cases.)

doyoubi · 2019-11-17T11:34:23Z

Oh, my fault, it's -P 32:

$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

It needs a higher throughput to see the difference.
I missed that two -P 3[2] when I pasted it.

mre · 2019-11-17T11:38:59Z

Ah right. Makes more sense now; although I'm confused about nobatch.

# make nobatch
PING_INLINE: 250871.78 requests per second
PING_BULK: 246761.25 requests per second

# make max
PING_INLINE: 1704448.62 requests per second
PING_BULK: 1707941.88 requests per second

# make minmax
PING_INLINE: 1661405.50 requests per second
PING_BULK: 1570598.38 requests per second

doyoubi · 2019-11-17T11:46:11Z

nobatch means not using tokio-batch at all.
max has the same implementation of this pull request.
minmax adds another light-weight timer to reduce the number of pthread_cond_wait calls to try to further optimize it. But the result is almost the same as max.

mre · 2019-11-17T14:57:49Z

But why is nobatch so much slower than the others on my machine? Shouldn't it be faster or at least equal?

doyoubi · 2019-11-18T01:30:49Z

Well, if we set the timeout of batching to a large number, it will be slower. But here the timeout is
20 microseconds, which could even improve throughput.

Let's ignore the light-weight timer case first. Using tokio-batch on the syscall sendto (the max case) should achieve higher throughput than not using tokio-batch at all (the nobatch case).

The key point of batching syscalls for optimization is that sending more data in a syscall does not significantly increase the latency of syscalls. Sending 16 bytes spends almost the same time as sending 8 bytes. So the more data packed in every sendto syscall, the more throughput we will have.

On OSX, we can use dtrace to collect some statistics on the behavior of syscall.

Nobatch case

We can collect the number of syscalls and the data sent by sendto while benchmarking:

# sudo dtrace -n 'syscall::sendto:entry { @num["count"] = count(); @bytes["bytes"] = sum(arg2); } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

dtrace: description 'syscall::sendto:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  18373                        :tick-10s

  count                                                       1454801
  bytes                                                      10184009

Within 10 seconds, it triggered 1454801 sendto and sent 10184009 bytes in total, which gives 7 bytes per syscall on average.

We can also collect the latency of syscall:

# sudo dtrace -n 'syscall::sendto:entry { self->ts = timestamp; } syscall::sendto:return /self->ts/ { @ns["latency"] = quantize(timestamp - self->ts); self->ts = 0; } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

CPU     ID                    FUNCTION:NAME
  2  18373                        :tick-10s

  latency
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@                         33097
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  48365
            8192 |@                                        2284
           16384 |                                         731
           32768 |                                         221
           65536 |                                         42
          131072 |                                         13
          262144 |                                         12
          524288 |                                         5
         1048576 |                                         2
         2097152 |                                         1
         4194304 |                                         4
         8388608 |                                         0

Most sendto syscall spend around 2048 to 4096 nanoseconds to finish the syscall inside the kernel.

Batching with timers

Collect the data for the case that using tokio-batch.

# sudo dtrace -n 'syscall::sendto:entry { @num["count"] = count(); @bytes["bytes"] = sum(arg2); } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

dtrace: description 'syscall::sendto:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  18373                        :tick-10s

  count                                                        341701
  bytes                                                      76533848

Within 10 seconds, we have fewer syscalls, 341701 compared to the previous 1454801, but have more throughput, 76533848 compared to 10184009. Since the length of the network packet of Redis PING is fixed, we can see the same scale as the QPS we saw in redis-benchmark around 7 times larger. On average, we have 223.9 bytes on each syscall this time.

And the latency of sendto didn't increase a lot.

# sudo dtrace -n 'syscall::sendto:entry { self->ts = timestamp; } syscall::sendto:return /self->ts/ { @ns["latency"] = quantize(timestamp - self->ts); self->ts = 0; } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

CPU     ID                    FUNCTION:NAME
  2  18373                        :tick-10s

  latency
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@                                        1096
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      68887
            8192 |@@@@                                     8056
           16384 |@                                        1143
           32768 |                                         279
           65536 |                                         53
          131072 |                                         26
          262144 |                                         20
          524288 |                                         11
         1048576 |                                         3
         2097152 |                                         1
         4194304 |                                         0
         8388608 |                                         2
        16777216 |                                         0

The flamegraph I generated in my Linux test environment shows that 88% of the time in this fake Redis server is spent on the syscall. It keeps firing the sendto syscalls. The more data we can process in each syscall, the larger throughput we have.

mre · 2019-11-18T10:00:54Z

That makes a lot of sense. Thanks for the very detailed explanation and for creating this pull request in general. I'll merge it now. 🎉 Will publish a new version right after, but we should also bump the version once your fix for SplitSink has landed.

doyoubi · 2019-11-18T10:51:20Z

Thanks for creating this crate! This is very helpful for optimizing most network services. I'm so glad to work on it.

doyoubi commented Oct 23, 2019

View reviewed changes

Replace tokio-timer with futures-timer for higher resolution

b343e90

doyoubi force-pushed the ReplaceTimer branch from 8a8f76e to fa763cd Compare November 2, 2019 11:58

doyoubi force-pushed the ReplaceTimer branch from fa763cd to b343e90 Compare November 9, 2019 06:28

doyoubi changed the title ~~[WIP] Replace tokio-timer with futures-timer for higher timer resolution~~ Replace tokio-timer with futures-timer for higher timer resolution Nov 9, 2019

mre merged commit 661b5b4 into mre:master Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace tokio-timer with futures-timer for higher timer resolution #5

Replace tokio-timer with futures-timer for higher timer resolution #5

doyoubi commented Oct 23, 2019 •

edited

doyoubi Oct 23, 2019

doyoubi commented Oct 25, 2019

mre commented Oct 25, 2019

doyoubi commented Oct 28, 2019 •

edited

mre commented Oct 28, 2019

doyoubi commented Oct 28, 2019

doyoubi commented Nov 2, 2019 •

edited

doyoubi commented Nov 9, 2019 •

edited

mre commented Nov 9, 2019

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 •

edited

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 •

edited

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 •

edited

mre commented Nov 17, 2019 •

edited

doyoubi commented Nov 18, 2019 •

edited

mre commented Nov 18, 2019

doyoubi commented Nov 18, 2019

Replace tokio-timer with futures-timer for higher timer resolution #5

Replace tokio-timer with futures-timer for higher timer resolution #5

Conversation

doyoubi commented Oct 23, 2019 • edited

doyoubi Oct 23, 2019

Choose a reason for hiding this comment

doyoubi commented Oct 25, 2019

mre commented Oct 25, 2019

doyoubi commented Oct 28, 2019 • edited

mre commented Oct 28, 2019

doyoubi commented Oct 28, 2019

doyoubi commented Nov 2, 2019 • edited

doyoubi commented Nov 9, 2019 • edited

Build the example

Benchmark without batching

Benchmark with just one timer driven by futures-timer

Benchmark with two timers, a light-weight timer and a timer driven by futures-timer

mre commented Nov 9, 2019

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 • edited

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 • edited

mre commented Nov 17, 2019

doyoubi commented Nov 17, 2019 • edited

mre commented Nov 17, 2019 • edited

doyoubi commented Nov 18, 2019 • edited

Nobatch case

Batching with timers

mre commented Nov 18, 2019

doyoubi commented Nov 18, 2019

doyoubi commented Oct 23, 2019 •

edited

doyoubi commented Oct 28, 2019 •

edited

doyoubi commented Nov 2, 2019 •

edited

doyoubi commented Nov 9, 2019 •

edited

doyoubi commented Nov 17, 2019 •

edited

doyoubi commented Nov 17, 2019 •

edited

doyoubi commented Nov 17, 2019 •

edited

mre commented Nov 17, 2019 •

edited

doyoubi commented Nov 18, 2019 •

edited