`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

artempyanykh · 2021-03-30T18:29:29Z

Version
1.4.0

Platform
64-bit WSL2 Linux:
Linux 4.19.104-microsoft-standard #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Description
The code is in this repo. The setup is explained in the README.

TL;DR:

Implement toy clone of du -hs with blocking and async APIs.
Blocking std::fs is about 35% slower than du, not bad.
An async version that uses tokio::fs but processes files sequentially is 64x! slower than the blocking version.
An async version that tries to do as many things concurrently as possible using FuturesUnordered and select! is 2.5x faster than the sequential version, but still 25x slower than a simple blocking version.

I understand that tokio::fs uses std::fs under the hood, that there's no non-blocking system API to FS (modulo 'io-uring' but 🤷‍♂️), that async has inherent overhead especially if the disk cache is hot and there's not much waiting on blocking calls.

However, 25x (not saying 64x) just feels too extreme of a slowdown, so I wonder

if this is actually expected,
or some tokio::fs code needs tuning/optimization,
or something else entirely (wrong setup?)

The text was updated successfully, but these errors were encountered:

Darksonn · 2021-03-30T18:44:17Z

I mean, we already know that it is never going to be as fast as using the blocking APIs directly. Did you try with a non-blocking std::fs::read_dir?

artempyanykh · 2021-03-30T19:01:51Z

@Darksonn

Did you try with a non-blocking std::fs::read_dir?

Sorry, not sure what you mean by non-blocking std::fs::read_dir. std::fs provides a blocking API.

My setup is described here in detail, with code and perf data.

I mean, we already know that it is never going to be as fast as using the blocking APIs directly.

There are several things at play here. First there's overhead from async, then from tokio::fs wrappers, but then there is a speed-up from parallel processing of files in case of async-par implementation.

In any case 25x to 64x slow-down from going to tokio::fs + async compared to a blocking version is pretty extreme, isn't it? We're talking about 200ms (feels instant) vs 12sec (feels like eternity) difference.

Darksonn · 2021-03-30T19:44:52Z

What I meant to suggest was to replace std::fs::read_dir in the linked code with tokio::fs::read_dir.

It is a big slowdown, and there have been several examples of people building really slow benchmarks and finding some trivial change to their code that yields a massive speedup, but those were all for reading the contents of the files. I think ultimately you are just running into a lot of back-and-forth between a bunch of threads, and that is just expensive.

artempyanykh · 2021-03-30T19:55:51Z

@Darksonn let me try to clarify. As I explained in the README.md there are several branches, each has its own implementation:

sync branch uses std::fs; it can be considered a baseline,
async-seq branch uses tokio::fs (incl. tokio::fs::read_dir and tokio::fs::symlink_metadata) and does processing sequentially (so option 1 but with tokio::fs and .await when necessary). This is 64x slower than option 1. Numbers are pretty much the same for both single- and multi-threaded runtimes. The amount of context switches is huge for both runtimes too.
async-par branch uses tokio::fs, but also does as many things concurrently as possible by utilising FuturesUnordered and select!.

If there is a trivial change to my code that can make, say, async-seq version to perform at least within 2x margin of sync version, I'd be more than happy to learn what it is 🙂

artempyanykh · 2021-03-31T08:22:08Z

I've done more testing on other platforms:

On MacOS BigSur the 'async-seq' version is ~3x slower than 'sync', but 'async-par' is ~15% faster than 'sync'.
On Windows 10 the 'async-seq' version is ~2.25x slower than 'sync', but 'async-par' is ~2x faster than `sync' (makes sense since my desktop has more cores than MBP and can benefit more from 'async-par').

This means that the issue is either Linux specific (unlikely) or WSL2 specific (seems more likely). I don't have a native linux box at hand to test this out now.

I also tried different version of rustc (1.49, 1.50, 1.51) but observed similar behaviour.

Darksonn · 2021-03-31T08:34:47Z

I tried running it on my laptop which is a native Linux box, but async-par failed with "too many open files". Here are the others:

Benchmark #1: du -hs ~/src
  Time (mean ± σ):     813.7 ms ±  21.5 ms    [User: 249.6 ms, System: 557.6 ms]
  Range (min … max):   785.1 ms … 853.3 ms    10 runs
 
Benchmark #2: builds/sync ~/src
  Time (mean ± σ):     884.7 ms ±   8.9 ms    [User: 239.9 ms, System: 638.6 ms]
  Range (min … max):   871.0 ms … 896.5 ms    10 runs
 
Benchmark #3: builds/async-seq ~/src
  Time (mean ± σ):      5.603 s ±  0.059 s    [User: 2.810 s, System: 4.733 s]
  Range (min … max):    5.537 s …  5.735 s    10 runs

There were all built with --release of course.

artempyanykh · 2021-03-31T08:51:23Z

Great, so async-seq is 6.3x slower, but not 64x, that's reassuring! 🙂

Could you try to increase the nofile limit and try async-par again (e.g. ulimit -S -n 4096 may help)?

Darksonn · 2021-03-31T08:54:10Z

Sure.

Benchmark #1: builds/async-par ~/src
  Time (mean ± σ):      4.462 s ±  1.566 s    [User: 5.288 s, System: 7.233 s]
  Range (min … max):    2.740 s …  7.184 s    10 runs

artempyanykh · 2021-03-31T09:00:25Z

Thank you! async-par performs better, but not to the extent I hoped. Both async versions are quite slow (good it's not 60x, but 6x is still a considerable slowdown).
I'm tempted to setup native Linux on my PC over the weekend and run it on the same set of files in Win, WSL2 and native Linux to have apples-to-apples comparison.

Darksonn · 2021-03-31T09:20:45Z

My main opinion on issues like this one is that if someone submits a PR that improves the speed of filesystem operations, I am happy to add those improvements (#3518 is an example), but it is not my a sufficiently large priority to spend time looking for fixes myself. People who need speedups for their fs ops can get already get speedups now by moving the operation into a single spawn_blocking call.

artempyanykh · 2021-03-31T10:10:10Z

@Darksonn that’s fair. To be clear, I don’t expect you spending time to diagnose the issue and come up with a fix, we all have different priorities and it’s fine.

The way I see it - this perf characteristics are surprising at the very least, so creating an issue is like putting a stick to the ground to tell “We’re aware of this”, and then maybe

there will be an improvement PR - either someone stumble upon this issue and comes up with an improvement, or I will dig deeper when I have spare time.
Or we will confirm that in this sort of workload the perf hit is just inherent and there’s nothing fishy going on. And in this case the good outcome would probably be a section in the docs, so that new users at least have awareness.

However, I can also see that this types of issues may be seen as not directly actionable. Which is totally fair. If this is the case for tokio project - I’d be fine to close the issue.

And in any case, I apologise for the inconvenience if I had missed something about this in the guidelines.

artempyanykh · 2021-04-13T16:52:19Z

Updated benchmarks https://github.com/artempyanykh/rdu:

Same machine, same set of files.
Ran on Native Linux, WSL2, and Native Windows;
On Linux with warm and cold disk cache:

On Windows perf. profile is very different from Linux; naive async version is ~2.2x slower which is kind of acceptable.
On native Linux with warm disk cache naive async version is 9x slower, and on WSL2 it's 55x slower.

ahmedriza · 2023-09-21T19:57:54Z

This is referred to in this talk Java and Rust by Yishai Galatzer. They used Tokio async fs operations (on a benchmark) and compared that with Java NIO.

IMO, it unfairly pitches Rust as being too slow compared to Java, which is of course not really true.

artempyanykh added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Mar 30, 2021

Darksonn added the M-fs Module: tokio/fs label Mar 30, 2021

driftluo mentioned this issue Apr 29, 2021

refactor: rewrite chunked file salvo-rs/salvo#21

Merged

jgraettinger mentioned this issue Mar 11, 2022

control: add build API and builder daemon service estuary/flow#408

Merged

BohuTANG mentioned this issue Jul 25, 2022

Support Output MySQL format datafuselabs/databend#6791

Open

icedrocket mentioned this issue Dec 23, 2022

fs: use chunks in fs::read_dir #5309

Merged

arssher mentioned this issue Jun 11, 2023

Switch safekeepers to async. neondatabase/neon#4119

Merged

zuston mentioned this issue Aug 21, 2023

read/write performance improvement zuston/riffle#75

Open

4 tasks

bsbds mentioned this issue Apr 19, 2024

refactor: switch wal to sync implementation xline-kv/Xline#705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

artempyanykh commented Mar 30, 2021 •

edited

Darksonn commented Mar 30, 2021

artempyanykh commented Mar 30, 2021

Darksonn commented Mar 30, 2021

artempyanykh commented Mar 30, 2021 •

edited

artempyanykh commented Mar 31, 2021 •

edited

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

artempyanykh commented Apr 13, 2021

ahmedriza commented Sep 21, 2023 •

edited

tokio::fs + async is 1-2 orders of magnitude slower than a blocking version #3664

tokio::fs + async is 1-2 orders of magnitude slower than a blocking version #3664

Comments

artempyanykh commented Mar 30, 2021 • edited

Darksonn commented Mar 30, 2021

artempyanykh commented Mar 30, 2021

Darksonn commented Mar 30, 2021

artempyanykh commented Mar 30, 2021 • edited

artempyanykh commented Mar 31, 2021 • edited

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

Darksonn commented Mar 31, 2021

artempyanykh commented Mar 31, 2021

artempyanykh commented Apr 13, 2021

ahmedriza commented Sep 21, 2023 • edited

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version #3664

artempyanykh commented Mar 30, 2021 •

edited

artempyanykh commented Mar 30, 2021 •

edited

artempyanykh commented Mar 31, 2021 •

edited

ahmedriza commented Sep 21, 2023 •

edited