Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying comment character for CSV reader #5759

Merged
merged 1 commit into from
May 13, 2024

Conversation

bbannier
Copy link
Contributor

This patch adds reader support for a comment character for reading CSV files. While comments like almost nothing around the CSV format are not truly standardized, a common format supported by many CSV readers12 is to ignore full lines starting with a comment character (often #); inline or end of line comments are not supported.

Example:

# This is a comment in a CSV file without header.
1,2
# Comment inside the data block.
11,22

The implementation of this for Arrow is pretty straight-forward as all we need to do is expose the existing comment option of csv_core used to read CSV files.

Closes #5758.

Footnotes

  1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

  2. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 12, 2024
This patch adds reader support for a comment character for reading CSV
files. While comments like almost nothing around the CSV format are not
truly standardized, a common format supported by many CSV
readers[^1][^2] is to ignore full lines starting with a comment
character (often `#`); inline or end of line comments are not supported.

Example:

    # This is a comment in a CSV file without header.
    1,2
    # Comment inside the data block.
    11,22

The implementation of this for Arrow is pretty straight-forward as all
we need to do is expose the existing `comment` option of `csv_core` used
to read CSV files.

Closes apache#5758.

[^1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
[^2]: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
@bbannier bbannier marked this pull request as ready for review May 12, 2024 08:40
@bbannier
Copy link
Contributor Author

The CI failure for integration / Archery test With other arrows (pull_request) seems preexisting, it e.g., fails on current master as well, https://github.com/apache/arrow-rs/actions/runs/9043234896/job/24850651652.

bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you.

Integration test failure is unrelated

@tustvold tustvold merged commit 6ab67df into apache:master May 13, 2024
21 of 22 checks passed
@bbannier bbannier deleted the t/comment branch June 3, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support skipping comments in CSV files
2 participants