Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Parquet writes all values of sliced arrays? #1323

Closed
ritchie46 opened this issue Dec 8, 2022 · 3 comments · Fixed by #1326
Closed

Parquet writes all values of sliced arrays? #1323

ritchie46 opened this issue Dec 8, 2022 · 3 comments · Fixed by #1326
Labels
bug Something isn't working

Comments

@ritchie46
Copy link
Collaborator

During parquet writes we write smaller pages because of the i32::limit and because of performance improvements when writing smaller pages during reading.

Nested structures such as lists and utf8 are then sliced by their offsets, but the whole values are then send to the pages writer. I haven't yet confirmed this, but I believe this is what happens and is the reason for:

1: invalid parquet files
2: extreme memory usage
3: extreme file sizes

all reported in: pola-rs/polars#4393

@ritchie46 ritchie46 changed the title Parquet writes all values of sliced arrays Parquet writes all values of sliced arrays? Dec 8, 2022
@ritchie46 ritchie46 changed the title Parquet writes all values of sliced arrays? Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow. Dec 10, 2022
@ritchie46 ritchie46 added the bug Something isn't working label Dec 10, 2022
@ritchie46 ritchie46 changed the title Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow. Parquet writes all values of sliced arrays? Dec 10, 2022
@ritchie46
Copy link
Collaborator Author

ritchie46 commented Dec 11, 2022

Exponential size

The parquet size also seems to grow exponential after a certain row number.

import os
import polars as pl 
import numpy as np

x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
    df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e4)})
    df.write_parquet('test-T1.parquet', use_pyarrow=True)
    df.write_parquet('test-T2.parquet', use_pyarrow=False)

    t1 = os.path.getsize('test-T1.parquet') / 1000
    t2 = os.path.getsize('test-T2.parquet') / 1000
    pas.append(t1)
    pls.append(t2)

print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')

plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend()

image

Linear

This doesn't seem to be the case for really small row numbers:

import os
import polars as pl 
import numpy as np

x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
    df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e2)})
    df.write_parquet('test-T1.parquet', use_pyarrow=True)
    df.write_parquet('test-T2.parquet', use_pyarrow=False)

    t1 = os.path.getsize('test-T1.parquet') / 1000
    t2 = os.path.getsize('test-T2.parquet') / 1000
    pas.append(t1)
    pls.append(t2)

print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')

plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend()

image

@ritchie46
Copy link
Collaborator Author

Confirmed that it write all values and doesn't take offsets into account. The written files are incorrect and write/read a lot of unneeded values. Working on a fix.

image

@tjwilson90
Copy link

This still appears to be a problem, see #1356 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants