write_parquet performs very badly on large files compared to write_csv #3845

vikigenius · 2022-06-28T19:21:28Z

What language are you using?

Python

Which feature gates did you use?

This can be ignored by Python & JS users.

Have you tried latest version of polars?

yes

If the problem was resolved, please update polars. :)

What version of polars are you using?

0.13.50

What operating system are you using polars on?

Linux data-vm 4.19.0-17-cloud-amd64 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

What language version are you using

python 3.9

Describe your bug.

I have a large dataframe that I am computing and trying to write it as parquet. It has around 100 million rows and 4 columns.

I tried two approaches write_csv and write _parquet

write_csv takes around 10s to write the file and uses around 10GB of RAM. The CSV file occupies 5GB or disk space.

write_parquet takes more than a minute and uses more than 50GB of RAM. The parquet file occupies 4.9GB of disk space.

I tried to set the compression method to uncompressed but that did not improve things at all.

What are the steps to reproduce the behavior?

Here is an example to reproduce the behaviour.

Example

import polars as pl
import numpy as np

df = pl.DataFrame({
    'doc_id': np.random.randint(0, high=1000000, size=1000000000),
    'passage_id': np.random.randint(0, high=100, size=1000000000),
    'score': np.random.random(size=1000000000)
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))

The dataframe occupies around 35 GB in RAM (not very reliable since I am relying on TOP to see process memory usage)

df.write_csv('../data/test_scores.csv') takes around 120 S and does not seem to use any additional RAM.
df.write_parquet('../data/test_scores.bin') runs out of memory, I am using a 128 GB RAM machine.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-06-28T19:57:48Z

@ritchie46 , the dataframe gets written in a single row group, right? In general we should split it in multiple groups or we need to hold the whole file in memory.

ritchie46 · 2022-06-28T20:14:16Z

@ritchie46 , the dataframe gets written in a single row group, right? In general we should split it in multiple groups or we need to hold the whole file in memory.

Yes, we do. I shall add that.

~~Note that the write csv is parallel and the write parquet not (yet).~~

Edit. That's not correct. I will have to investigate this one

ritchie46 · 2022-06-29T08:17:41Z

I see we heap alloc the pages for every write. I will see if we can improve that

zundertj · 2022-08-13T10:52:44Z

Updated the labels to reflect that the output is as expected, just not at the desired performance level.

@vikigenius : could you provide feedback on whether the row_group_size parameter added in #3852 helps you in speeding up your specific use case?

vikigenius · 2022-08-15T15:51:09Z

@zundertj I just ran

import polars as pl
import numpy as np

df = pl.DataFrame({
    'doc_id': np.random.randint(0, high=1000000, size=1000000000),
    'passage_id': np.random.randint(0, high=100, size=1000000000),
    'score': np.random.random(size=1000000000)
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))

df.write_parquet('test.bin', row_group_size=1000000)

This basically runs out of memory too, just much slower than not passing row_group_size

ritchie46 · 2022-08-15T16:48:01Z

@vikigenius can you try for a batch of row_group_sizes? 1M rows is still very large. That's one million rows per thread. I was thing more in the order of 2^10..2^16.

vikigenius · 2022-08-16T01:33:12Z

@ritchie46 Tried again with df.write_parquet('test.bin', row_group_size=1024). Somehow this runs out of memory too, but even slower than df.write_parquet('test.bin', row_group_size=1000000). Maybe the memory is not being freed after each row group write? Because I see that the memory just slowly keeps building and building and I reach 100% memory usage (128 GB RAM) and the process is killed.

ritchie46 · 2022-08-16T15:10:45Z

Thanks for the additional info. I will take a look.

ritchie46 · 2022-10-06T10:04:42Z

I did a local test and can confirm the memory stay stable if we set the row group size to something like 1million. Is this the case for you/others as well?

vikigenius · 2022-10-06T19:38:18Z

@ritchie46 Nope, still getting the same problem with 0.14.18. It ran out of memory after about 2 hours and it still didn't manage to finish writing the parquet.

vikigenius · 2022-10-28T17:00:24Z

@ritchie46 Is there anything else I can help with reproducing the issue? The fact that it only happens for parquet and not csv is strange. Also it doesn't seem to matter what row_group_size you set. A lower row_group_size means RAM usage stays low for a long time, but it slowly builds up and runs out of memory.

A very high row_group_size means the RAM just runs out of memory more quickly. My only guess is some issue with buffering and memory leak, because with a low row_group_size I would never expect the RAM usage to keep building over time and run out.

ritchie46 · 2022-10-29T10:11:47Z

Strange that I cannot reproduce it. We only keep around the metadata from the row groups. If your row groups are too small it could be that the amount of metadata we store goes OOM. How did your run go with a row group size of of 1million? That should only be 1000 metadata entries right?

vikigenius · 2022-10-29T16:45:06Z

@ritchie46 I am testing the code again with row group size of 1 million, but this time with a memory profiler to track memory usage over time. Also I have switched to a 512 GB VM so that I don't get OOM and actually find out the max memory usage.

I am using row group size of 1 million like you suggested. The whole snippet took > 10 hrs to execute. The final parquet size is 43GB.

The max memory usage reported by the profiler is 234114MB. (> 200 GB)

This is not good at all. This should not be taking 200GB of RAM. And it should not take 10 hours to write a 40 GB parquet file either. One thing I noticed was that this was using just one CPU.

Here is the python script I used to reproduce this

import polars as pl
import numpy as np
from memory_profiler import memory_usage


def ram_test():
    df = pl.DataFrame({
        'doc_id': np.random.randint(0, high=1000000, size=1000000000),
        'passage_id': np.random.randint(0, high=100, size=1000000000),
        'score': np.random.random(size=1000000000),
    }).with_column(pl.lit('40240gh32152n').alias('memo_id'))
    df.write_parquet('test.bin', row_group_size=1000000)

mem_usage = memory_usage(ram_test, interval=10) # Measure memory usage every 10 seconds
print('Maximum memory usage: %s' % max(mem_usage))

vikigenius · 2022-10-30T01:25:57Z

Doing the same thing for CSV:

import polars as pl
import numpy as np
from memory_profiler import memory_usage


def ram_test():
    df = pl.DataFrame({
        'doc_id': np.random.randint(0, high=1000000, size=1000000000),
        'passage_id': np.random.randint(0, high=100, size=1000000000),
        'score': np.random.random(size=1000000000),
    }).with_column(pl.lit('40240gh32152n').alias('memo_id'))
    df.write_csv('test.csv')

mem_usage = memory_usage(ram_test, interval=10)
print('Maximum memory usage: %s' % max(mem_usage))

Time Taken = 10s (Just 10 seconds?)
Maximum memory usage = 43494MB (43GB)

Obviously the fact that writing a CSV takes 10s and uses 43GB of RAM compared to writing a parquet takes 10hrs and uses up 200GB of RAM is ridiculous

I have no idea what's causing this ridiculous disparity. If you are not able to reproduce this? Maybe it has something to do with the arrow version or something?

For reference I am using: 0.14.24 version of polars (latest in PyPi)

ritchie46 · 2022-10-30T05:25:35Z

Yes, it certainly is. I shall investigate when I can get my hands on a bigger machine. Seems something quadratic to me. Any insight from the profiler?

ritchie46 · 2022-11-04T08:14:40Z

Ok, I have been able to reproduce this on the edges of my ram capacity. Am exploring now.

ritchie46 · 2022-11-04T08:27:48Z

@jorgecarleitao could you help me a bit with this one?

I ran this snippet:

use polars::prelude::*;


fn main() -> PolarsResult<()> {
    let n = 5_00_000_000u32;
    let ca = Utf8Chunked::from_iter((0..n).map(|_| "40240gh32152n"));
    let s = ca.into_series();
    let s1 = Series::new("", (0..n).collect::<Vec<_>>());
    let mut df = df![
        "a" => s.clone(),
        "b" => s1.clone(),
        "c" => s1.clone(),
        "d" => s1.clone(),
    ]?;

    let f = std::fs::File::create("test.parquet").unwrap();
    ParquetWriter::new(f)
        .with_row_group_size(Some(1_000_000))
        .finish(&mut df);
    Ok(())

}

And found that this code is fast. Until the Utf8 column reaches a certain size. 1e8 was still fast, 5e8 had a huge slowdown. I did not finish the run, but just took enough so that we can see in the flamegraph where most time is lost and it seems to be the parquet_fromat::DataPageHeaderV2::write_to_out_protocol.

The culprit seems to be the entry_syscall_64_after_hwframe syscall.

Here is the flamegraph:

EDIT:

Only the Utf8 column seems to be the culprit. The numerical columns don't influence this.

ritchie46 · 2022-11-04T10:40:06Z

I did a run where I checked debug_assertions and overflow-checks, but that did not influence this.

ritchie46 · 2022-11-05T14:17:18Z

@vikigenius I found and fixed the issue upstream: jorgecarleitao/arrow2#1293.

I will release a patch this weekend. Thanks for being patient and helping me remember this issue when it got stale. :)

vikigenius added the bug Something isn't working label Jun 28, 2022

ritchie46 mentioned this issue Jun 29, 2022

parquet: allow writing smaller row groups #3852

Merged

zundertj added performance Performance issues or improvements and removed bug Something isn't working labels Aug 13, 2022

ritchie46 self-assigned this Aug 16, 2022

ritchie46 mentioned this issue Nov 5, 2022

Parquet writer stalls at a certain column size for Utf8 dtypes. jorgecarleitao/arrow2#1292

Closed

ritchie46 closed this as completed in #5366 Nov 6, 2022

ritchie46 mentioned this issue Nov 6, 2022

fix(rust, python): fix freeze/stall when writing more than 2^31 string values to parquet #5366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_parquet performs very badly on large files compared to write_csv #3845

write_parquet performs very badly on large files compared to write_csv #3845

vikigenius commented Jun 28, 2022

jorgecarleitao commented Jun 28, 2022

ritchie46 commented Jun 28, 2022 •

edited

ritchie46 commented Jun 29, 2022

zundertj commented Aug 13, 2022

vikigenius commented Aug 15, 2022

ritchie46 commented Aug 15, 2022

vikigenius commented Aug 16, 2022

ritchie46 commented Aug 16, 2022

ritchie46 commented Oct 6, 2022

vikigenius commented Oct 6, 2022 •

edited

vikigenius commented Oct 28, 2022

ritchie46 commented Oct 29, 2022

vikigenius commented Oct 29, 2022 •

edited

vikigenius commented Oct 30, 2022 •

edited

ritchie46 commented Oct 30, 2022

ritchie46 commented Nov 4, 2022

ritchie46 commented Nov 4, 2022 •

edited

ritchie46 commented Nov 4, 2022

ritchie46 commented Nov 5, 2022 •

edited

write_parquet performs very badly on large files compared to write_csv #3845

write_parquet performs very badly on large files compared to write_csv #3845

Comments

vikigenius commented Jun 28, 2022

What language are you using?

Which feature gates did you use?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

jorgecarleitao commented Jun 28, 2022

ritchie46 commented Jun 28, 2022 • edited

ritchie46 commented Jun 29, 2022

zundertj commented Aug 13, 2022

vikigenius commented Aug 15, 2022

ritchie46 commented Aug 15, 2022

vikigenius commented Aug 16, 2022

ritchie46 commented Aug 16, 2022

ritchie46 commented Oct 6, 2022

vikigenius commented Oct 6, 2022 • edited

vikigenius commented Oct 28, 2022

ritchie46 commented Oct 29, 2022

vikigenius commented Oct 29, 2022 • edited

vikigenius commented Oct 30, 2022 • edited

ritchie46 commented Oct 30, 2022

ritchie46 commented Nov 4, 2022

ritchie46 commented Nov 4, 2022 • edited

ritchie46 commented Nov 4, 2022

ritchie46 commented Nov 5, 2022 • edited

ritchie46 commented Jun 28, 2022 •

edited

vikigenius commented Oct 6, 2022 •

edited

vikigenius commented Oct 29, 2022 •

edited

vikigenius commented Oct 30, 2022 •

edited

ritchie46 commented Nov 4, 2022 •

edited

ritchie46 commented Nov 5, 2022 •

edited