Group by col A & aggregating col B into a list on a decently sized CSV blows the memory and/or takes a while #7210

JoeHFB · 2021-02-12T16:41:02Z

JoeHFB
Feb 12, 2021

I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?

Background

I have a request to build a file ingestion tool that would:

Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services

Based on my research, since files are only few GBs, I found that Spark, etc would likely be overkill, and Pandas/Dask may be a good fit, hence the POC.

Problem

Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!). I tried block sizes of 1MB -> 64MB
Note: if I don't set the block size dask crashes even on input 1GB
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes

What am I doing wrong here?

for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why its ram usage skyrockets to 9GB for a 1GB csv input.

Did I miss something while going through the docs?

Notes about the data in the CSV (can't share it unfortunately)

1 integer column and 8 string columns
string column user_id is the column used in the group by
What I'd like to do is collect a list of tuples of multiple columns per user id. Ex: user 1 -> [(A1, B1, C1), (A2, B2, C2)], user 2 -> [(A3, B3, C3), (A4, B4, C4)], ...
For the sake of this POC, I'm only aggregating a list of one other column order_id
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines

I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.

In my research, I found this issue #4001 that said group by doesn't scale very well if there is a lot of groups, but since user_id only has 10 possible values, I figured that wouldn't be much of an issue... welp

My code

pandas

#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys

test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()

for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
    df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
    pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)

final.to_csv('pandastest.csv', index=False)

dask

#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys

test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')

# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list)

collection.to_csv('./dasktest.csv', single_file=True)

screenshots of mem usage as dask runs 1GB file

~15 seconds in
~45 seconds in
~1 min in

😭

Someone suggested setting an index to the group by column user_id - while I understood how that would help from performance standpoint, I wasn't sure how that would help with the memory usage (wouldn't adding an index use more memory? I'm already hitting 9GB usage with 1GB input file here, and I need to be able to handle a few GBs). I tried adding df = df.set_index('user_id') in the code, and unfortunately it didn't help.

Any tips / pointers on what I could research into would be greatly appreciated

JoeHFB · 2021-02-16T22:17:14Z

JoeHFB
Feb 16, 2021
Author

Hmm, I dug around the documentation, issues, etc a bit more, and found:

I did the following to experiment with the built in group by list aggregate,

>>> import dask.dataframe as dd
>>> from dask.distributed import Client
>>> client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='6GB')
>>> df = dd.read_csv('2gb.csv', blocksize=6400000, delimiter='|')
>>> df.groupby('user_id')['order_id'].agg(list)

and was able to successfully generate data from my 2GB csv of nonsense random text

but when I try the same thing with a 5GB file, the process ends up getting killled.

>>> df = dd.read_csv('5gb.csv', blocksize=6400000, delimiter='|')
>>> df.groupby('user_id')['order_id'].agg(list).compute()
distributed.worker - WARNING - gc.collect() took 1.310s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
distributed.worker - WARNING - Worker is at 119% memory usage. Pausing worker.  Process memory: 7.18 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - Worker is at 48% memory usage. Resuming worker. Process memory: 2.92 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - gc.collect() took 1.224s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.61 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.65 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 4.13 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - Worker is at 94% memory usage. Pausing worker.  Process memory: 5.66 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - gc.collect() took 1.142s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.64 GB -- Worker memory limit: 6.00 GB
distributed.worker - WARNING - Worker is at 60% memory usage. Resuming worker. Process memory: 3.64 GB -- Worker memory limit: 6.00 GB
Killed

Reading around stackoverflow it seems like this might be very task specific depending on what the person is doing - does anyone have any tips on how to handle such scenarios? I mean, making a 5GB csv of random data with 10 possible random user_ids and then grouping a list of another column for the entirety of the 5GB file may be extreme, but reading around the docs my understanding was that Dask would try to free up memory by moving data to disk often to make these kind of bigger than memory data processing possible. Did I miss something?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group by col A & aggregating col B into a list on a decently sized CSV blows the memory and/or takes a while #7210

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Group by col A & aggregating col B into a list on a decently sized CSV blows the memory and/or takes a while #7210

JoeHFB Feb 12, 2021

Background

Problem

Notes about the data in the CSV (can't share it unfortunately)

My code

pandas

dask

screenshots of mem usage as dask runs 1GB file

Replies: 1 comment

JoeHFB Feb 16, 2021 Author

JoeHFB
Feb 12, 2021

JoeHFB
Feb 16, 2021
Author