You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add the ability to parse csv files that have flexible number of columns. Specifically a large subset of CSVs have columns missing from the ends of rows, and expect them to be treated as null.
Describe the solution you'd like
The ability to configure via the format or reader builder the option to enable flexible columns.
Describe alternatives you've considered
I have tried python and rust solutions, and while pandas and polars work for the general case, they both have poor support for streaming reads of csv into arrow buffers. Specifically they require either memory mapped files, or buffering most of the file in memory to work, unlike the convenience of the build/build_buffered methods offered by arrow-csv. And while the Rust csv crate is excellent, it is limited to row at a time parsing, and from basic testing I've done arrow-csv outperforms it when it comes to loading large datasets into arrow buffers.
Additional context
Example from other implementations:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add the ability to parse csv files that have flexible number of columns. Specifically a large subset of CSVs have columns missing from the ends of rows, and expect them to be treated as null.
Describe the solution you'd like
The ability to configure via the format or reader builder the option to enable flexible columns.
Describe alternatives you've considered
I have tried python and rust solutions, and while pandas and polars work for the general case, they both have poor support for streaming reads of csv into arrow buffers. Specifically they require either memory mapped files, or buffering most of the file in memory to work, unlike the convenience of the build/build_buffered methods offered by arrow-csv. And while the Rust csv crate is excellent, it is limited to row at a time parsing, and from basic testing I've done arrow-csv outperforms it when it comes to loading large datasets into arrow buffers.
Additional context
Example from other implementations:
Rust csv
https://docs.rs/csv/latest/csv/struct.ReaderBuilder.html#method.flexible
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv
Pandas doesn't specify, but testing shows it allows missing trailing columns by default.
Similarly Polars behaves the same as Pandas and Rust.
https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html
However one conflict is that the arrow-cpp csv parser doesn't allow ragged/flexible columns like the current arrow-csv.
The text was updated successfully, but these errors were encountered: