Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally support flexible column lengths #5678

Closed
Posnet opened this issue Apr 22, 2024 · 0 comments · Fixed by #5679
Closed

Optionally support flexible column lengths #5678

Posnet opened this issue Apr 22, 2024 · 0 comments · Fixed by #5679
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Posnet
Copy link
Contributor

Posnet commented Apr 22, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add the ability to parse csv files that have flexible number of columns. Specifically a large subset of CSVs have columns missing from the ends of rows, and expect them to be treated as null.

Describe the solution you'd like
The ability to configure via the format or reader builder the option to enable flexible columns.

Describe alternatives you've considered
I have tried python and rust solutions, and while pandas and polars work for the general case, they both have poor support for streaming reads of csv into arrow buffers. Specifically they require either memory mapped files, or buffering most of the file in memory to work, unlike the convenience of the build/build_buffered methods offered by arrow-csv. And while the Rust csv crate is excellent, it is limited to row at a time parsing, and from basic testing I've done arrow-csv outperforms it when it comes to loading large datasets into arrow buffers.

Additional context
Example from other implementations:

Rust csv
https://docs.rs/csv/latest/csv/struct.ReaderBuilder.html#method.flexible

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv

Pandas doesn't specify, but testing shows it allows missing trailing columns by default.

Similarly Polars behaves the same as Pandas and Rust.

https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html

However one conflict is that the arrow-cpp csv parser doesn't allow ragged/flexible columns like the current arrow-csv.

@Posnet Posnet added the enhancement Any new improvement worthy of a entry in the changelog label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant