Parquet writes all values of sliced arrays? #1323

ritchie46 · 2022-12-08T11:59:12Z

During parquet writes we write smaller pages because of the i32::limit and because of performance improvements when writing smaller pages during reading.

Nested structures such as lists and utf8 are then sliced by their offsets, but the whole values are then send to the pages writer. I haven't yet confirmed this, but I believe this is what happens and is the reason for:

1: invalid parquet files
2: extreme memory usage
3: extreme file sizes

all reported in: pola-rs/polars#4393

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-12-11T08:41:13Z

Exponential size

The parquet size also seems to grow exponential after a certain row number.

import os
import polars as pl 
import numpy as np

x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
    df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e4)})
    df.write_parquet('test-T1.parquet', use_pyarrow=True)
    df.write_parquet('test-T2.parquet', use_pyarrow=False)

    t1 = os.path.getsize('test-T1.parquet') / 1000
    t2 = os.path.getsize('test-T2.parquet') / 1000
    pas.append(t1)
    pls.append(t2)

print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')

plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend()

Linear

This doesn't seem to be the case for really small row numbers:

import os
import polars as pl 
import numpy as np

x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
    df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e2)})
    df.write_parquet('test-T1.parquet', use_pyarrow=True)
    df.write_parquet('test-T2.parquet', use_pyarrow=False)

    t1 = os.path.getsize('test-T1.parquet') / 1000
    t2 = os.path.getsize('test-T2.parquet') / 1000
    pas.append(t1)
    pls.append(t2)

print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')

plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend()

ritchie46 · 2022-12-11T15:17:44Z

Confirmed that it write all values and doesn't take offsets into account. The written files are incorrect and write/read a lot of unneeded values. Working on a fix.

tjwilson90 · 2023-01-22T03:46:11Z

This still appears to be a problem, see #1356 (comment)

ritchie46 changed the title ~~Parquet writes all values of sliced arrays~~ Parquet writes all values of sliced arrays? Dec 8, 2022

ritchie46 mentioned this issue Dec 10, 2022

PyArrow cannot read nested-column parquet file written from polars pola-rs/polars#5762

Closed

2 tasks

ritchie46 changed the title ~~Parquet writes all values of sliced arrays?~~ Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow. Dec 10, 2022

ritchie46 added the bug Something isn't working label Dec 10, 2022

ritchie46 changed the title ~~Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow.~~ Parquet writes all values of sliced arrays? Dec 10, 2022

ritchie46 mentioned this issue Dec 11, 2022

Fixed writing nested/sliced arrays to parquet #1326

Merged

jorgecarleitao closed this as completed in #1326 Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet writes all values of sliced arrays? #1323

Parquet writes all values of sliced arrays? #1323

ritchie46 commented Dec 8, 2022

ritchie46 commented Dec 11, 2022 •

edited

ritchie46 commented Dec 11, 2022

tjwilson90 commented Jan 22, 2023

Parquet writes all values of sliced arrays? #1323

Parquet writes all values of sliced arrays? #1323

Comments

ritchie46 commented Dec 8, 2022

ritchie46 commented Dec 11, 2022 • edited

Exponential size

Linear

ritchie46 commented Dec 11, 2022

tjwilson90 commented Jan 22, 2023

ritchie46 commented Dec 11, 2022 •

edited