Optionally support flexible column lengths #5678

Posnet · 2024-04-22T15:30:07Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add the ability to parse csv files that have flexible number of columns. Specifically a large subset of CSVs have columns missing from the ends of rows, and expect them to be treated as null.

Describe the solution you'd like
The ability to configure via the format or reader builder the option to enable flexible columns.

Describe alternatives you've considered
I have tried python and rust solutions, and while pandas and polars work for the general case, they both have poor support for streaming reads of csv into arrow buffers. Specifically they require either memory mapped files, or buffering most of the file in memory to work, unlike the convenience of the build/build_buffered methods offered by arrow-csv. And while the Rust csv crate is excellent, it is limited to row at a time parsing, and from basic testing I've done arrow-csv outperforms it when it comes to loading large datasets into arrow buffers.

Additional context
Example from other implementations:

Rust csv
https://docs.rs/csv/latest/csv/struct.ReaderBuilder.html#method.flexible

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv

Pandas doesn't specify, but testing shows it allows missing trailing columns by default.

Similarly Polars behaves the same as Pandas and Rust.

https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html

However one conflict is that the arrow-cpp csv parser doesn't allow ragged/flexible columns like the current arrow-csv.

Posnet added the enhancement Any new improvement worthy of a entry in the changelog label Apr 22, 2024

Posnet mentioned this issue Apr 22, 2024

Add support for flexible column lengths #5679

Merged

Jefffrey closed this as completed in #5679 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally support flexible column lengths #5678

Optionally support flexible column lengths #5678

Posnet commented Apr 22, 2024

Optionally support flexible column lengths #5678

Optionally support flexible column lengths #5678

Comments

Posnet commented Apr 22, 2024