diff --git a/README.md b/README.md index b11607a..4f25f84 100644 --- a/README.md +++ b/README.md @@ -2,23 +2,26 @@ A simple benchmark for python-based data processing libraries. -Guiding Principles: +## Guiding Principles: - Python Only: I have no intention of using another language to process data. That said, non-Python implementations with Python bindings are perfectly fine (e.g. Polars). - Single Node: I'm not interested in benchmarking performance on clusters of multiple machines. - Simple Implementation: I'm not interested in heavily optimizing the benchmarking implementations. My target user is myself: somebody who largely uses pandas but would like to either speed up their computations or work with larger datasets than fit in memory on their machine. -Libraries: +## Libraries: -- ~~[Dask](https://www.dask.org/)~~ Not currently working on my machine (OOM) +- [Dask*](https://www.dask.org/) - [dask-sql](https://dask-sql.readthedocs.io/en/latest/) - [DuckDB](https://duckdb.org/) - [Polars](https://www.pola.rs/) - [Spark](https://spark.apache.org/docs/latest/api/python/) - [Vaex](https://vaex.io/) -Results: +\*Dask required a slight, non-beginner optimization to successfully run it on my machine. Specifically, I had to do "manual column projection" by passing in the relevant calculation columns when reading in the dataset. I consider the Dask results to be slightly cheating, although this hack [may be resolved](https://github.com/dask/dask/issues/7933) natively in Dask in the not so distant future. + + +## Results: The following results are from running the benchmark locally on my desktop that has: - Intel Core i7-7820X 3.6 GHz 8-core Processor (16 virtual cores, I think) @@ -27,7 +30,7 @@ The following results are from running the benchmark locally on my desktop that Original 50-partition dataset: -![assets/benchmark.png](assets/benchmark.png) +![assets/benchmark_50.png](assets/benchmark_50.png) Bakeoff as a function of partitions: @@ -43,7 +46,7 @@ Since 2016, I have been pinging the public API every 2 minutes and storing the r For the benchmark, I first convert the CSVs to snappy-compressed parquet files. This reduces the dataset down to ~4GB in size on disk. -# Computation +## Computation To start, the bakeoff computation is extremely simple. It's almost "word count". I calculate the average number of bikes available at each station across the full time period of the dataset. In SQL, this is basically @@ -108,7 +111,6 @@ docker cp medium-data-bakeoff:/app/results/* . - Run benchmarks on AWS for reproducibility. - Run benchmarks as a function of number of CPUs. -- Run benchmarks as a function of number of parquet files. - Add some harder benchmark computations. - Add more libraries (e.g. `cudf`). - Benchmark memory usage. diff --git a/assets/benchmark.png b/assets/benchmark.png deleted file mode 100644 index 6a27ee3..0000000 Binary files a/assets/benchmark.png and /dev/null differ diff --git a/assets/benchmark_50.png b/assets/benchmark_50.png new file mode 100644 index 0000000..9f05c3d Binary files /dev/null and b/assets/benchmark_50.png differ diff --git a/assets/partition_benchmark.png b/assets/partition_benchmark.png index e2e607f..b82de6c 100644 Binary files a/assets/partition_benchmark.png and b/assets/partition_benchmark.png differ diff --git a/assets/results.csv b/assets/results.csv deleted file mode 100644 index 72b5605..0000000 --- a/assets/results.csv +++ /dev/null @@ -1,6 +0,0 @@ -Library,Time (s),multiple -dask_sql,16.145472764968872,7.445934935749845 -duckdb,2.168360710144043,1.0 -polars,3.5379786491394043,1.6316375004343158 -spark,10.843884706497192,5.000959783013611 -vaex,22.44853186607361,10.352766382942978 diff --git a/assets/results_50.csv b/assets/results_50.csv new file mode 100644 index 0000000..c07cfee --- /dev/null +++ b/assets/results_50.csv @@ -0,0 +1,7 @@ +Library,Time (s),multiple +dask* (slightly optimized),11.725253820419312,5.416939032107234 +dask_sql,15.683753967285156,7.245722807927704 +duckdb,2.16455340385437,1.0 +polars,3.482872486114502,1.6090489982426082 +spark,10.895400285720825,5.03355577474764 +vaex,22.58011507987976,10.431766220076563 diff --git a/src/medium_data_bakeoff/bakeoff.py b/src/medium_data_bakeoff/bakeoff.py index c73db83..ed340f3 100644 --- a/src/medium_data_bakeoff/bakeoff.py +++ b/src/medium_data_bakeoff/bakeoff.py @@ -56,10 +56,7 @@ def bakeoff(num_partitions: int) -> None: bakeoff = {} recipe = [ - # Dask was working on 2022.10.2. It got downgraded to 2022.10.0 after - # installing dask-sql, and the benchmark no longer works on my machine. I think - # it's running out of memory. - # ("dask", bake_dask), + ("dask* (slightly optimized)", bake_dask), ("dask_sql", bake_dask_sql), ("duckdb", bake_duckdb), ("polars", bake_polars), diff --git a/src/medium_data_bakeoff/ingredients/dask.py b/src/medium_data_bakeoff/ingredients/dask.py index 71912d9..f55896e 100644 --- a/src/medium_data_bakeoff/ingredients/dask.py +++ b/src/medium_data_bakeoff/ingredients/dask.py @@ -10,7 +10,13 @@ def bake(dataset: str) -> float: client = Client(cluster) ProgressBar().register() start = time.time() - df = dd.read_parquet(dataset, index=False) + # NOTE: We pass in the relevant columns for the calculation because Dask + # currently does not do predicate pushdown. This is noticeably a bit of a cheat + # and not fully in the spirit of the bakeoff given that this is somewhat + # advanced knowledge that a user coming from pandas may not have. + df = dd.read_parquet( + dataset, index=False, columns=["station_id", "num_bikes_available"] + ) df.groupby("station_id")["num_bikes_available"].mean().compute() stop = time.time() client.close()