Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad I/O overhead - related to Mutex in File? see #2930 #2994

Open
stsydow opened this issue Oct 19, 2020 · 7 comments
Open

Bad I/O overhead - related to Mutex in File? see #2930 #2994

stsydow opened this issue Oct 19, 2020 · 7 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. M-fs Module: tokio/fs

Comments

@stsydow
Copy link
Contributor

stsydow commented Oct 19, 2020

File I/O has an overhead increase of factor three since the 0.3 release (on AArch64 linux kernel v5.7).
I suspect the Mutex protecting the file introduces in 12bba6f / #2930

I noticed that for my application the syscall sys_enter_futex increased from 15 (in v0.1) to 248,792 (v0.3)
and am still searching for a better solution

Benchmarks

use tokio::runtime::Runtime;
use futures::future::{self, FutureExt};
use futures::TryFutureExt;
use futures::stream::{self, Stream, StreamExt};

use test::Bencher;
use test::black_box;

const BLOCK_COUNT:usize = 1_000;
const BUFFER_SIZE:usize = 4096;

  #[bench]
  fn async_read_codec(b: &mut Bencher) {
      use tokio_util::codec::{BytesCodec, FramedRead, /*FramedWrite*/};
      use tokio::fs::{File};
      let runtime = Runtime::new().expect("can not start runtime");
  
      b.iter(|| {
          let file_future = File::open("/dev/zero");
          let file: tokio::fs::File = runtime
                  .block_on(file_future)
                  .expect("Can't open input file.");
          let input_stream = FramedRead::with_capacity(file , BytesCodec::new(), BUFFER_SIZE);
          let task = input_stream.take(BLOCK_COUNT).for_each(|_i| async {()});
          runtime.block_on(task);
      });
  
  }

  #[bench]
  fn async_read(b: &mut Bencher) {
      use tokio::fs::{File};
      use tokio::io::{AsyncReadExt};
      let runtime = Runtime::new().expect("can not start runtime");
  
      b.iter(|| {
          let task = File::open("/dev/zero")
              .then(|r_file| async move {
                  let mut file = r_file.expect("Can't open input file.");
                  let mut buffer = [0u8; BUFFER_SIZE];
  
                  for _i in 0 .. BLOCK_COUNT {
                      let count = file.read(&mut buffer).await
                          .expect("Can't read file.");
                      if count == 0 { break; }
                  }
              });
          runtime.block_on(task);
      });
  
  }

 #[bench]
  fn sync_read(b: &mut Bencher) {
      use std::fs::File;
      use std::io::Read;
  
      b.iter(|| {
          let mut file = File::open("/dev/zero").expect("Unable to open file");
  
          let mut buffer = [0u8; BUFFER_SIZE];
  
          for _i in 0 .. BLOCK_COUNT {
              file.read_exact(&mut buffer)
                  .expect("err reading file");
          }
      });
  
  }
 
 #[bench]
 fn async_read_std_file(b: &mut Bencher) {
     use std::fs::File;
     use std::io::Read;
     let runtime = Runtime::new().expect("can not start runtime");
 
     b.iter(|| {
         let mut file = Box::pin(File::open("/dev/zero").expect("Unable to open file"));
 
         let input_stream = stream::unfold(file.as_mut(), |mut file| async move {
             let mut buffer = [0u8; BUFFER_SIZE];
 
             let file = tokio::task::block_in_place(move || {
                 file.read_exact(&mut buffer)
                     .expect("err reading file");
                 file
             });
 
             let next_state = file;
             let yielded = buffer;
             Some( (yielded, next_state) )
         });
 
         let task = input_stream.take(BLOCK_COUNT).for_each(|_i| async {()});
         runtime.block_on(task);
     });
 
 }

Results

$ RUSTFLAGS="-C force-frame-pointers -C target-cpu=native" cargo bench

test tests::async_read ... bench: 21,046,747 ns/iter (+/- 312,690)
test tests::async_read_codec ... bench: 20,161,426 ns/iter (+/- 307,704)
test tests::async_read_std_file ... bench: 6,429,688 ns/iter (+/- 15,344)
test tests::sync_read ... bench: 1,852,882 ns/iter (+/- 4,623)

$ RUSTFLAGS="-C` force-frame-pointers -C target-cpu=native" perf stat -e syscalls:sys_enter_futex,syscalls:sys_enter_read,syscalls:sys_enter_sigaltstack,syscalls:sys_enter_sched_yield,syscalls:sys_enter_munmap,syscalls:sys_enter_mprotect,syscalls:sys_enter_mmap cargo bench tests::sync_read

Performance counter stats for 'cargo bench tests::sync_read':

           65      syscalls:sys_enter_futex                                    
      301,889      syscalls:sys_enter_read                                     
          248      syscalls:sys_enter_sigaltstack                                   
            1      syscalls:sys_enter_sched_yield                                   
          160      syscalls:sys_enter_munmap                                   
          186      syscalls:sys_enter_mprotect                                   
          212      syscalls:sys_enter_mmap                                     

 0.832083750 seconds time elapsed

 0.155121000 seconds user
 0.678299000 seconds sys

$ RUSTFLAGS="-C force-frame-pointers -C target-cpu=native" perf stat -e syscalls:sys_enter_futex,syscalls:sys_enter_read,syscalls:sys_enter_sigaltstack,syscalls:sys_enter_sched_yield,syscalls:sys_enter_munmap,syscalls:sys_enter_mprotect,syscalls:sys_enter_mmap cargo bench tests::async_read_codec

Performance counter stats for 'cargo bench tests::async_read_codec':

    1,808,958      syscalls:sys_enter_futex                                    
      301,892      syscalls:sys_enter_read                                     
          539      syscalls:sys_enter_sigaltstack                                   
          289      syscalls:sys_enter_sched_yield                                   
          385      syscalls:sys_enter_munmap                                   
          476      syscalls:sys_enter_mprotect                                   
          501      syscalls:sys_enter_mmap                                     

  9.696616310 seconds time elapsed

  3.239587000 seconds user
  9.045317000 seconds sys
@stsydow stsydow added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Oct 19, 2020
@Darksonn Darksonn added the M-fs Module: tokio/fs label Oct 19, 2020
@Darksonn
Copy link
Contributor

I'd be happy to accept PRs with an improved implementation, but it's not my priority at the moment. File IO has never been async/await's strong suit.

@Darksonn
Copy link
Contributor

Having another look, this does not call any of the methods that lock the mutex from the linked issue. That mutex can't be the cause, though I don't immediately know what else it could be.

@stsydow
Copy link
Contributor Author

stsydow commented Oct 20, 2020

I have integrated the benchmarks into tokio/benches and can provide a PR later this evening.

@Darksonn
Copy link
Contributor

You say that you tested on 0.1 and 0.3. Did you try 0.2?

@carllerche
Copy link
Member

Has there been a performance regression from 0.2 -> 0.3? I don't believe 0.1 had any FS apis.

Incidentally, given how FS ops work, the best strategy for optimal performance right now is going to be to spawn blocking tasks that batch FS work.

@stsydow
Copy link
Contributor Author

stsydow commented Oct 20, 2020

Good point, it has already happend in v0.2.

see v0.2.x...stsydow:fs-benchmarks-v0.2

test async_read_buf ... bench: 19,405,485 ns/iter (+/- 15,516,299)
test async_read_codec ... bench: 19,605,318 ns/iter (+/- 12,464,927)
test async_read_std_file ... bench: 3,504,704 ns/iter (+/- 6,423)
test sync_read ... bench: 1,847,179 ns/iter (+/- 4,318)

Here are the profiling results as callgraph:
https://github.com/stsydow/tokio/blob/fs-benchmarks-v0.2/async-read.svg

v0.1 for reference:
v0.1.x...stsydow:fs-benchmarks-v0.1

test async_read (_buf) ... bench: 2,304,018 ns/iter (+/- 16,327)
test async_read_codec ... bench: 3,430,182 ns/iter (+/- 19,466)
test sync_read ... bench: 1,847,927 ns/iter (+/- 4,205)

@stsydow
Copy link
Contributor Author

stsydow commented Oct 28, 2020

I suspect now it is rather a problem with the runtime then with file I/O directly.
For each read operation we see one read system call and four futex calls.

Here is a minimal standalone example for profiling:

use tokio::fs::File;
use tokio::io::AsyncReadExt;

fn rt() -> tokio::runtime::Runtime {
    tokio::runtime::Builder::new_current_thread()
    //tokio::runtime::Builder::new_multi_thread().worker_threads(2)
        .build()
        .unwrap()
}

const BLOCK_COUNT: usize = 100_000;

const BUFFER_SIZE: usize = 4096;
const DEV_ZERO: &'static str = "/dev/zero";

pub fn main() -> () {
    let rt = rt();
    let task = || async {
        let mut file = File::open(DEV_ZERO).await.unwrap();
        let mut buffer = [0u8; BUFFER_SIZE];

        for _i in 0..BLOCK_COUNT {
            let count = file.read(&mut buffer).await.unwrap();
            if count == 0 {
                break;
            }
        }
    };

    rt.block_on(task());
}

$perf stat -r3 -e syscalls:sys_enter_futex,syscalls:sys_enter_read,context-switches,cycles:k,cycles:u,instructions,task-clock ./target/release/fs_standalone

Performance counter stats for './target/release/fs_standalone' (3 runs):

       400,007      syscalls:sys_enter_futex  #    0.062 M/sec                    ( +-  0.00% )
       100,007      syscalls:sys_enter_read   #    0.015 M/sec
       200,008      context-switches          #    0.031 M/sec                    ( +-  0.00% )
 7,105,184,825      cycles:k                  #    1.095 GHz                      ( +-  0.04% )
 5,044,722,992      cycles:u                  #    0.777 GHz                      ( +-  0.40% )
 2,215,426,983      instructions                                                  ( +-  0.07% )
      6,491.24 msec task-clock                #    0.929 CPUs utilized            ( +-  0.13% )

        6.9869 +- 0.0131 seconds time elapsed  ( +-  0.19% )

From the profile an important location is:

let lock_result = self.condvar.wait_timeout(shared, self.keep_alive).unwrap();

I know it is the idle case, but it becomes ready "sooner than expected" for files and probably also channels
and creates alot of overhead. I'm not sure if that helps.

@carllerche carllerche added the E-help-wanted Call for participation: Help is requested to fix this issue. label Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. M-fs Module: tokio/fs
Projects
None yet
Development

No branches or pull requests

3 participants