Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_parquet performs very badly on large files compared to write_csv #3845

Closed
vikigenius opened this issue Jun 28, 2022 · 19 comments · Fixed by #5366
Closed

write_parquet performs very badly on large files compared to write_csv #3845

vikigenius opened this issue Jun 28, 2022 · 19 comments · Fixed by #5366
Assignees
Labels
performance Performance issues or improvements

Comments

@vikigenius
Copy link

What language are you using?

Python

Which feature gates did you use?

This can be ignored by Python & JS users.

Have you tried latest version of polars?

  • yes

If the problem was resolved, please update polars. :)

What version of polars are you using?

0.13.50

What operating system are you using polars on?

Linux data-vm 4.19.0-17-cloud-amd64 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

What language version are you using

python 3.9

Describe your bug.

I have a large dataframe that I am computing and trying to write it as parquet. It has around 100 million rows and 4 columns.

I tried two approaches write_csv and write _parquet

write_csv takes around 10s to write the file and uses around 10GB of RAM. The CSV file occupies 5GB or disk space.

write_parquet takes more than a minute and uses more than 50GB of RAM. The parquet file occupies 4.9GB of disk space.

I tried to set the compression method to uncompressed but that did not improve things at all.

What are the steps to reproduce the behavior?

Here is an example to reproduce the behaviour.

Example

import polars as pl
import numpy as np

df = pl.DataFrame({
    'doc_id': np.random.randint(0, high=1000000, size=1000000000),
    'passage_id': np.random.randint(0, high=100, size=1000000000),
    'score': np.random.random(size=1000000000)
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))

The dataframe occupies around 35 GB in RAM (not very reliable since I am relying on TOP to see process memory usage)

df.write_csv('../data/test_scores.csv') takes around 120 S and does not seem to use any additional RAM.
df.write_parquet('../data/test_scores.bin') runs out of memory, I am using a 128 GB RAM machine.

@vikigenius vikigenius added the bug Something isn't working label Jun 28, 2022
@jorgecarleitao
Copy link
Collaborator

@ritchie46 , the dataframe gets written in a single row group, right? In general we should split it in multiple groups or we need to hold the whole file in memory.

@ritchie46
Copy link
Member

ritchie46 commented Jun 28, 2022

@ritchie46 , the dataframe gets written in a single row group, right? In general we should split it in multiple groups or we need to hold the whole file in memory.

Yes, we do. I shall add that.

Note that the write csv is parallel and the write parquet not (yet).

Edit. That's not correct. I will have to investigate this one

@ritchie46
Copy link
Member

I see we heap alloc the pages for every write. I will see if we can improve that

@zundertj zundertj added performance Performance issues or improvements and removed bug Something isn't working labels Aug 13, 2022
@zundertj
Copy link
Collaborator

Updated the labels to reflect that the output is as expected, just not at the desired performance level.

@vikigenius : could you provide feedback on whether the row_group_size parameter added in #3852 helps you in speeding up your specific use case?

@vikigenius
Copy link
Author

@zundertj I just ran

import polars as pl
import numpy as np

df = pl.DataFrame({
    'doc_id': np.random.randint(0, high=1000000, size=1000000000),
    'passage_id': np.random.randint(0, high=100, size=1000000000),
    'score': np.random.random(size=1000000000)
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))

df.write_parquet('test.bin', row_group_size=1000000)

This basically runs out of memory too, just much slower than not passing row_group_size

@ritchie46
Copy link
Member

@vikigenius can you try for a batch of row_group_sizes? 1M rows is still very large. That's one million rows per thread. I was thing more in the order of 2^10..2^16.

@vikigenius
Copy link
Author

@ritchie46 Tried again with df.write_parquet('test.bin', row_group_size=1024). Somehow this runs out of memory too, but even slower than df.write_parquet('test.bin', row_group_size=1000000). Maybe the memory is not being freed after each row group write? Because I see that the memory just slowly keeps building and building and I reach 100% memory usage (128 GB RAM) and the process is killed.

@ritchie46 ritchie46 self-assigned this Aug 16, 2022
@ritchie46
Copy link
Member

Thanks for the additional info. I will take a look.

@ritchie46
Copy link
Member

I did a local test and can confirm the memory stay stable if we set the row group size to something like 1million. Is this the case for you/others as well?

@vikigenius
Copy link
Author

vikigenius commented Oct 6, 2022

@ritchie46 Nope, still getting the same problem with 0.14.18. It ran out of memory after about 2 hours and it still didn't manage to finish writing the parquet.

@vikigenius
Copy link
Author

@ritchie46 Is there anything else I can help with reproducing the issue? The fact that it only happens for parquet and not csv is strange. Also it doesn't seem to matter what row_group_size you set. A lower row_group_size means RAM usage stays low for a long time, but it slowly builds up and runs out of memory.

A very high row_group_size means the RAM just runs out of memory more quickly. My only guess is some issue with buffering and memory leak, because with a low row_group_size I would never expect the RAM usage to keep building over time and run out.

@ritchie46
Copy link
Member

Strange that I cannot reproduce it. We only keep around the metadata from the row groups. If your row groups are too small it could be that the amount of metadata we store goes OOM. How did your run go with a row group size of of 1million? That should only be 1000 metadata entries right?

@vikigenius
Copy link
Author

vikigenius commented Oct 29, 2022

@ritchie46 I am testing the code again with row group size of 1 million, but this time with a memory profiler to track memory usage over time. Also I have switched to a 512 GB VM so that I don't get OOM and actually find out the max memory usage.

I am using row group size of 1 million like you suggested. The whole snippet took > 10 hrs to execute. The final parquet size is 43GB.

The max memory usage reported by the profiler is 234114MB. (> 200 GB)

This is not good at all. This should not be taking 200GB of RAM. And it should not take 10 hours to write a 40 GB parquet file either. One thing I noticed was that this was using just one CPU.

Here is the python script I used to reproduce this

import polars as pl
import numpy as np
from memory_profiler import memory_usage


def ram_test():
    df = pl.DataFrame({
        'doc_id': np.random.randint(0, high=1000000, size=1000000000),
        'passage_id': np.random.randint(0, high=100, size=1000000000),
        'score': np.random.random(size=1000000000),
    }).with_column(pl.lit('40240gh32152n').alias('memo_id'))
    df.write_parquet('test.bin', row_group_size=1000000)

mem_usage = memory_usage(ram_test, interval=10) # Measure memory usage every 10 seconds
print('Maximum memory usage: %s' % max(mem_usage))

@vikigenius
Copy link
Author

vikigenius commented Oct 30, 2022

Doing the same thing for CSV:

import polars as pl
import numpy as np
from memory_profiler import memory_usage


def ram_test():
    df = pl.DataFrame({
        'doc_id': np.random.randint(0, high=1000000, size=1000000000),
        'passage_id': np.random.randint(0, high=100, size=1000000000),
        'score': np.random.random(size=1000000000),
    }).with_column(pl.lit('40240gh32152n').alias('memo_id'))
    df.write_csv('test.csv')

mem_usage = memory_usage(ram_test, interval=10)
print('Maximum memory usage: %s' % max(mem_usage))

Time Taken = 10s (Just 10 seconds?)
Maximum memory usage = 43494MB (43GB)

Obviously the fact that writing a CSV takes 10s and uses 43GB of RAM compared to writing a parquet takes 10hrs and uses up 200GB of RAM is ridiculous

I have no idea what's causing this ridiculous disparity. If you are not able to reproduce this? Maybe it has something to do with the arrow version or something?

For reference I am using: 0.14.24 version of polars (latest in PyPi)

@ritchie46
Copy link
Member

Yes, it certainly is. I shall investigate when I can get my hands on a bigger machine. Seems something quadratic to me. Any insight from the profiler?

@ritchie46
Copy link
Member

Ok, I have been able to reproduce this on the edges of my ram capacity. Am exploring now.

@ritchie46
Copy link
Member

ritchie46 commented Nov 4, 2022

@jorgecarleitao could you help me a bit with this one?

I ran this snippet:

use polars::prelude::*;


fn main() -> PolarsResult<()> {
    let n = 5_00_000_000u32;
    let ca = Utf8Chunked::from_iter((0..n).map(|_| "40240gh32152n"));
    let s = ca.into_series();
    let s1 = Series::new("", (0..n).collect::<Vec<_>>());
    let mut df = df![
        "a" => s.clone(),
        "b" => s1.clone(),
        "c" => s1.clone(),
        "d" => s1.clone(),
    ]?;

    let f = std::fs::File::create("test.parquet").unwrap();
    ParquetWriter::new(f)
        .with_row_group_size(Some(1_000_000))
        .finish(&mut df);
    Ok(())

}

And found that this code is fast. Until the Utf8 column reaches a certain size. 1e8 was still fast, 5e8 had a huge slowdown. I did not finish the run, but just took enough so that we can see in the flamegraph where most time is lost and it seems to be the parquet_fromat::DataPageHeaderV2::write_to_out_protocol.

The culprit seems to be the entry_syscall_64_after_hwframe syscall.

Here is the flamegraph:

flamegraph

EDIT:

Only the Utf8 column seems to be the culprit. The numerical columns don't influence this.

@ritchie46
Copy link
Member

I did a run where I checked debug_assertions and overflow-checks, but that did not influence this.

@ritchie46
Copy link
Member

ritchie46 commented Nov 5, 2022

@vikigenius I found and fixed the issue upstream: jorgecarleitao/arrow2#1293.

I will release a patch this weekend. Thanks for being patient and helping me remember this issue when it got stale. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements
Projects
None yet
4 participants