try to replace parquet with parquet2? #53

chunshao90 · 2022-06-17T06:48:33Z

Description

replace parquet with parquet2

The five main differentiators in comparison with parquet are:

it uses #![forbid(unsafe_code)]
delegates parallelism downstream
decouples reading (IO intensive) from computing (CPU intensive)
it is faster (10-20x when reading to arrow format)
supports async read and write.
It is integration-tested against pyarrow and (py)spark 3

Proposal

Additional context

The text was updated successfully, but these errors were encountered:

waynexia · 2022-06-17T06:53:27Z

One more benefit: switching to parquet2 can decouple our parquet dependence with datafusion, we can update them separately.

But migrating this kind of dep is a big work...

jiacai2050 · 2022-08-17T12:59:53Z

One more reason to migrate to parquet2.

ArrowWriter have no method to retrieve inner writer, but parquet2 does have into_inner

jiacai2050 · 2022-08-18T06:32:44Z

Report to upstream:

add into_inner method to ArrowWriter arrow-rs#2491

tustvold · 2022-08-18T08:25:15Z

👏 parquet maintainer here, FWIW there is little read performance difference these days that I have been able to reproduce, there is mature support for decoupled IO (async), we integration test against pyarrow, and recent work by myself and others to add page and row-level filter pushdown should dramatically improve the performance of filtered scans.

There are definitely areas to improve, most notably the writer hasn't had the same degree of attention, but by working together we can pull the whole ecosystem along 😀

Anyway enough from me, just thought I'd provide an alternative narrative to the parquet2/arrow2 FUD...

jiacai2050 · 2022-08-19T01:34:35Z

@tustvold Awesome work. It's seems we need to re-evaluate the performance of parquet.

but by working together we can pull the whole ecosystem along 😀

We would love to share what we learn when build CeresDB, and keep communicating with upstream ecosystem to make it better. 🍺

chunshao90 added the feature New feature or request label Jun 17, 2022

jiacai2050 mentioned this issue Dec 19, 2022

fix: remove EncodingWriter #493

Merged

ShiKaiWi closed this as completed in #493 Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to replace parquet with parquet2? #53

try to replace parquet with parquet2? #53

chunshao90 commented Jun 17, 2022

waynexia commented Jun 17, 2022

jiacai2050 commented Aug 17, 2022

jiacai2050 commented Aug 18, 2022

tustvold commented Aug 18, 2022

jiacai2050 commented Aug 19, 2022

try to replace parquet with parquet2? #53

try to replace parquet with parquet2? #53

Comments

chunshao90 commented Jun 17, 2022

waynexia commented Jun 17, 2022

jiacai2050 commented Aug 17, 2022

jiacai2050 commented Aug 18, 2022

tustvold commented Aug 18, 2022

jiacai2050 commented Aug 19, 2022