Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to replace parquet with parquet2? #53

Closed
chunshao90 opened this issue Jun 17, 2022 · 5 comments · Fixed by #493
Closed

try to replace parquet with parquet2? #53

chunshao90 opened this issue Jun 17, 2022 · 5 comments · Fixed by #493
Labels
feature New feature or request

Comments

@chunshao90
Copy link
Contributor

Description

replace parquet with parquet2

The five main differentiators in comparison with parquet are:

  • it uses #![forbid(unsafe_code)]
  • delegates parallelism downstream
  • decouples reading (IO intensive) from computing (CPU intensive)
  • it is faster (10-20x when reading to arrow format)
  • supports async read and write.
  • It is integration-tested against pyarrow and (py)spark 3

Proposal

Additional context

@chunshao90 chunshao90 added the feature New feature or request label Jun 17, 2022
@waynexia
Copy link
Member

One more benefit: switching to parquet2 can decouple our parquet dependence with datafusion, we can update them separately.

But migrating this kind of dep is a big work...

@jiacai2050
Copy link
Contributor

One more reason to migrate to parquet2.

ArrowWriter have no method to retrieve inner writer, but parquet2 does have into_inner

@jiacai2050
Copy link
Contributor

Report to upstream:

@tustvold
Copy link

👏 parquet maintainer here, FWIW there is little read performance difference these days that I have been able to reproduce, there is mature support for decoupled IO (async), we integration test against pyarrow, and recent work by myself and others to add page and row-level filter pushdown should dramatically improve the performance of filtered scans.

There are definitely areas to improve, most notably the writer hasn't had the same degree of attention, but by working together we can pull the whole ecosystem along 😀

Anyway enough from me, just thought I'd provide an alternative narrative to the parquet2/arrow2 FUD...

@jiacai2050
Copy link
Contributor

@tustvold Awesome work. It's seems we need to re-evaluate the performance of parquet.

but by working together we can pull the whole ecosystem along 😀

We would love to share what we learn when build CeresDB, and keep communicating with upstream ecosystem to make it better. 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants