Skip to content

tomaztk/Benchmarking-file-formats-for-cloud

Repository files navigation

Benchmarking file formats for cloud Storage

Repository holds test scripts for benchmark different file formats. CSV is relative uncompressed, sparse format but very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

Using Python

Covered formats with Python

Benchmarking different file formats for cloud storage.

  1. CSV
  2. AVRO
  3. Parquet
  4. Pickle
  5. ORC
  6. TXT

Python scripts for benchmark test

create_df()
 
# results for write
print(timeit.Timer(WRITE_CSV_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_ORC_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PARQUET_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PICKLE_fun_timeIt).timeit(number=number_of_runs))
 
CLEAN_files()

Using R

Covered formats with R

  1. CSV
  2. Parquet
  3. Feather

R scripts for benchmarking

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Comparing read and write times

Comparing read and write times for each file extension and see, which one performs better for given task.

Example in case of testing with R:

alt text

Cloning the repository

You can follow the steps below to clone the repository.

git clone https://github.com/tomaztk/Benchmarking-file-formats-for-cloud.git

Using Azure Blob storage for data lake

Running SQL Server on-premise and uploading data to data lake, there is a python script (Jupyter notebook) with detailed steps and script.

Link

Related Blog posts

Contributors and co-authors

Thanks to these wonderful R community people for upgrading and improving these benchmarks. Your contributions are highly appreciated!

Ryan Duryea
Ryan Duryea

About

Benchmarking file formats for cloud storage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published