Add column projection to dask

EthanRosenthal · Nov 16, 2022 · 6b7452e · 6b7452e · rjzamora · Nov 16, 2022
1 parent df25e79
commit 6b7452e
Show file tree

Hide file tree

Showing 8 changed files with 24 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -2,23 +2,26 @@
 
 A simple benchmark for python-based data processing libraries.
 
-Guiding Principles:
+## Guiding Principles:
 
 - Python Only: I have no intention of using another language to process data. That said, non-Python implementations with Python bindings are perfectly fine (e.g. Polars).
 - Single Node: I'm not interested in benchmarking performance on clusters of multiple machines.
 - Simple Implementation: I'm not interested in heavily optimizing the benchmarking implementations. My target user is myself: somebody who largely uses pandas but would like to either speed up their computations or work with larger datasets than fit in memory on their machine.
 
 
-Libraries:
+## Libraries:
 
-- ~~[Dask](https://www.dask.org/)~~ Not currently working on my machine (OOM)
+- [Dask*](https://www.dask.org/)
 - [dask-sql](https://dask-sql.readthedocs.io/en/latest/)
 - [DuckDB](https://duckdb.org/)
 - [Polars](https://www.pola.rs/)
 - [Spark](https://spark.apache.org/docs/latest/api/python/)
 - [Vaex](https://vaex.io/)
 
-Results:
+\*Dask required a slight, non-beginner optimization to successfully run it on my machine. Specifically, I had to do "manual column projection" by passing in the relevant calculation columns when reading in the dataset. I consider the Dask results to be slightly cheating, although this hack [may be resolved](https://github.com/dask/dask/issues/7933) natively in Dask in the not so distant future.
+
+
+## Results:
 
 The following results are from running the benchmark locally on my desktop that has:
 - Intel Core i7-7820X 3.6 GHz 8-core Processor (16 virtual cores, I think)
@@ -27,7 +30,7 @@ The following results are from running the benchmark locally on my desktop that
 
 Original 50-partition dataset:
 
-![assets/benchmark.png](assets/benchmark.png)
+![assets/benchmark_50.png](assets/benchmark_50.png)
 
 Bakeoff as a function of partitions:
 
@@ -43,7 +46,7 @@ Since 2016, I have been pinging the public API every 2 minutes and storing the r
 
 For the benchmark, I first convert the CSVs to snappy-compressed parquet files. This reduces the dataset down to ~4GB in size on disk.
 
-# Computation
+## Computation
 
 To start, the bakeoff computation is extremely simple. It's almost "word count". I calculate the average number of bikes available at each station across the full time period of the dataset. In SQL, this is basically
 
@@ -108,7 +111,6 @@ docker cp medium-data-bakeoff:/app/results/* .
 
 - Run benchmarks on AWS for reproducibility.
 - Run benchmarks as a function of number of CPUs.
-- Run benchmarks as a function of number of parquet files.
 - Add some harder benchmark computations.
 - Add more libraries (e.g. `cudf`).
 - Benchmark memory usage.
diff --git a/assets/benchmark.png b/assets/benchmark.png
diff --git a/assets/benchmark_50.png b/assets/benchmark_50.png
diff --git a/assets/partition_benchmark.png b/assets/partition_benchmark.png
diff --git a/assets/results.csv b/assets/results.csv
diff --git a/assets/results_50.csv b/assets/results_50.csv
@@ -0,0 +1,7 @@
+Library,Time (s),multiple
+dask* (slightly optimized),11.725253820419312,5.416939032107234
+dask_sql,15.683753967285156,7.245722807927704
+duckdb,2.16455340385437,1.0
+polars,3.482872486114502,1.6090489982426082
+spark,10.895400285720825,5.03355577474764
+vaex,22.58011507987976,10.431766220076563
diff --git a/src/medium_data_bakeoff/bakeoff.py b/src/medium_data_bakeoff/bakeoff.py
@@ -56,10 +56,7 @@ def bakeoff(num_partitions: int) -> None:
     bakeoff = {}
 
     recipe = [
-        # Dask was working on 2022.10.2. It got downgraded to 2022.10.0 after
-        # installing dask-sql, and the benchmark no longer works on my machine. I think
-        # it's running out of memory.
-        # ("dask", bake_dask),
+        ("dask* (slightly optimized)", bake_dask),
         ("dask_sql", bake_dask_sql),
         ("duckdb", bake_duckdb),
         ("polars", bake_polars),

diff --git a/src/medium_data_bakeoff/ingredients/dask.py b/src/medium_data_bakeoff/ingredients/dask.py
@@ -10,7 +10,13 @@ def bake(dataset: str) -> float:
         client = Client(cluster)
         ProgressBar().register()
         start = time.time()
-        df = dd.read_parquet(dataset, index=False)
+        # NOTE: We pass in the relevant columns for the calculation because Dask
+        # currently does not do predicate pushdown. This is noticeably a bit of a cheat
+        # and not fully in the spirit of the bakeoff given that this is somewhat
+        # advanced knowledge that a user coming from pandas may not have.
+        df = dd.read_parquet(
+            dataset, index=False, columns=["station_id", "num_bikes_available"]
+        )
         df.groupby("station_id")["num_bikes_available"].mean().compute()
         stop = time.time()
         client.close()