-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specifying comment character for CSV reader #5759
Conversation
This patch adds reader support for a comment character for reading CSV files. While comments like almost nothing around the CSV format are not truly standardized, a common format supported by many CSV readers[^1][^2] is to ignore full lines starting with a comment character (often `#`); inline or end of line comments are not supported. Example: # This is a comment in a CSV file without header. 1,2 # Comment inside the data block. 11,22 The implementation of this for Arrow is pretty straight-forward as all we need to do is expose the existing `comment` option of `csv_core` used to read CSV files. Closes apache#5758. [^1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html [^2]: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
The CI failure for |
This commit switches to used version of arrow-rs to the version of apache/arrow-rs#5759 which introduces support for comments in CSV input files.
This commit switches to used version of arrow-rs to the version of apache/arrow-rs#5759 which introduces support for comments in CSV input files.
This commit switches to used version of arrow-rs to the version of apache/arrow-rs#5759 which introduces support for comments in CSV input files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you.
Integration test failure is unrelated
This patch adds reader support for a comment character for reading CSV files. While comments like almost nothing around the CSV format are not truly standardized, a common format supported by many CSV readers12 is to ignore full lines starting with a comment character (often
#
); inline or end of line comments are not supported.Example:
The implementation of this for Arrow is pretty straight-forward as all we need to do is expose the existing
comment
option ofcsv_core
used to read CSV files.Closes #5758.
Footnotes
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ↩
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html ↩