Replies: 1 comment
-
Hmm, I dug around the documentation, issues, etc a bit more, and found:
I did the following to experiment with the built in group by
and was able to successfully generate data from my 2GB csv of nonsense random text but when I try the same thing with a 5GB file, the process ends up getting killled.
Reading around stackoverflow it seems like this might be very task specific depending on what the person is doing - does anyone have any tips on how to handle such scenarios? I mean, making a 5GB csv of random data with 10 possible random |
Beta Was this translation helpful? Give feedback.
-
I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
{ user -> [collection of info]}
Based on my research, since files are only few GBs, I found that Spark, etc would likely be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
What am I doing wrong here?
blocksize
, and as such the ram usage should expect to beblocksize * size per block
, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why its ram usage skyrockets to 9GB for a 1GB csv input.Did I miss something while going through the docs?
Notes about the data in the CSV (can't share it unfortunately)
user_id
is the column used in the group byuser 1 -> [(A1, B1, C1), (A2, B2, C2)], user 2 -> [(A3, B3, C3), (A4, B4, C4)], ...
order_id
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
In my research, I found this issue #4001 that said group by doesn't scale very well if there is a lot of groups, but since
user_id
only has 10 possible values, I figured that wouldn't be much of an issue... welpMy code
pandas
dask
screenshots of mem usage as dask runs 1GB file
😭
Someone suggested setting an index to the group by column
user_id
- while I understood how that would help from performance standpoint, I wasn't sure how that would help with the memory usage (wouldn't adding an index use more memory? I'm already hitting 9GB usage with 1GB input file here, and I need to be able to handle a few GBs). I tried addingdf = df.set_index('user_id')
in the code, and unfortunately it didn't help.Any tips / pointers on what I could research into would be greatly appreciated
Beta Was this translation helpful? Give feedback.
All reactions