Failing Out-of-Core Merge #1143

quasiben · 2023-03-27T21:38:07Z

I have a somewhat representative (and currently failing) example of merging two dataframes in a resources constrained environment:

df_base = 295GB and 10674 partitions
df_other = 466GB and 2576 partitions

Each dataframe has two random int columns Key and Payload, both int64

In [7]: ddf_base.head()
Out[7]:
            key   payload
shuffle
0        113664  38855413
0        113671  13729563
0        113673   2885245
0        113681  14508293
0        113689   9661924

Here's a more complete script:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

def main():
    cluster = LocalCUDACluster(protocol="ucx",
                               CUDA_VISIBLE_DEVICES="4,5,6,7",
                               threads_per_worker=1,
                               rmm_pool_size="30GB",
                               rmm_async=True,
                               rmm_release_threshold='25GB')
    client = Client(cluster)
    ddf_base = dask_cudf.read_parquet('/datasets/bzaitlen/GitRepos/random-parquet-data/df_base.parquet/')
    ddf_other = dask_cudf.read_parquet('/datasets/bzaitlen/GitRepos/random-parquet-data/df_other.parquet/')
    ddf_join = ddf_base.merge(ddf_other, on=["key"], how="inner")
    ddf_join.to_parquet("/datasets/bzaitlen/GitRepos/random-parquet-data/df_join.parquet")



if __name__ == "__main__":
    main()

The above fails with OOM and sometimes random UCX errors (after hours off waiting) using the somewhat new cudf_spilling manager using both explicit-comms and regular tasks. Note that the above script limits the number of GPUs to 4 -- the equivalent of 128GB of GPU memory (significantly less than the total data size) :

time DASK_EXPLICIT_COMM=False CUDF_SPILL=1 python ooc-merge.py 2>&1 | tee merge-res.txt

The hope is that p2p shuffling provides stable infrastructure for accomplishing this out-of-core shuffle/merge

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing Out-of-Core Merge #1143

Failing Out-of-Core Merge #1143

quasiben commented Mar 27, 2023

Failing Out-of-Core Merge #1143

Failing Out-of-Core Merge #1143

Comments

quasiben commented Mar 27, 2023