Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: error_bad_lines and warn_bad_lines for read_csv #40413

Merged
merged 30 commits into from May 28, 2021

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Mar 13, 2021

Summary of contents
- Adds new on_bad_lines parameter (I found on_bad_lines a more explicit name than bad_lines)
- This defaults to None at first in order to preserve compatibility, however it should be changed to error in 2.0 after
error_bad_lines and warn_bad_lines are removed.
- Cleanup of some C/Python Parser code ( add enum instead of using 2 variables for C and only use on_bad_lines in Python)

@lithomas1 lithomas1 added Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv labels Mar 13, 2021
@lithomas1 lithomas1 requested a review from WillAyd March 14, 2021 17:21
@lithomas1 lithomas1 requested a review from gfyoung March 16, 2021 23:13
@pep8speaks
Copy link

pep8speaks commented Mar 16, 2021

Hello @lithomas1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-28 00:23:35 UTC

@lithomas1 lithomas1 requested review from jreback and removed request for jreback March 22, 2021 16:08
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, am +1 on this change, some comments

doc/source/user_guide/io.rst Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/io/parsers/python_parser.py Outdated Show resolved Hide resolved
pandas/io/parsers/python_parser.py Outdated Show resolved Hide resolved
pandas/io/parsers/readers.py Show resolved Hide resolved
@@ -382,6 +402,7 @@
"memory_map": False,
"error_bad_lines": True,
"warn_bad_lines": True,
"on_bad_lines": None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should remove error/warn_bad_lines from here

Copy link
Member Author

@lithomas1 lithomas1 Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_options_with_defaults is really spaghetti-fied right now, so removing this would not the args not passed to the parser. I will try to clean up _get_options_with_defaults in a future PR if I have time.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2021

cc @gfyoung if you'd have a look

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to set the original options to None to avoid way more complexity.

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Show resolved Hide resolved
pandas/io/parsers/base_parser.py Outdated Show resolved Hide resolved
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
else:
if kwds.get("error_bad_lines"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need a deprecation warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deprecated_defaults handles this for us.

pandas/io/parsers/readers.py Outdated Show resolved Hide resolved
pandas/io/parsers/readers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am not seeing it. but it seems that the old options (error_bad_lines / warn_bad_lines) are still passed around. these should be handled in exactly 1 place, then immediately discarded (with a possible warning / error if multiple things are specified), and then only on_bad_lines should exist.

self.on_bad_lines = self.BadLineHandleMethod.SKIP
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
# Override on_bad_lines w/ deprecated args for backward compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't all of these cases show a deprecation warning? L227-238

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, I said before this is handled by _deprecated_defaults. Ideally _deprecated_defaults wouldn't exist, but as we all know, read_csv is very bloated in terms of API, so its nice to show the deprecation warnings from one place. If you are still unsure, this is tested in this PR here https://github.com/pandas-dev/pandas/pull/40413/files#diff-2f09043040b80b8e52cb8525b43b1653de0ce7d04e0999ca8e0dbe69b05b88faR758-R770

@lithomas1 lithomas1 requested a review from jreback May 25, 2021 00:17
@lithomas1 lithomas1 added this to the 1.3 milestone May 25, 2021
@jreback jreback merged commit fd346ae into pandas-dev:master May 28, 2021
@jreback
Copy link
Contributor

jreback commented May 28, 2021

thanks @lithomas1 this was a bear! thanks for sticking with it

@lithomas1 lithomas1 deleted the depr-bad-lines branch May 28, 2021 20:36
@lithomas1
Copy link
Member Author

Thanks for the reviews @jreback. Would it be possible to mention this in the deprecations tracker #30228?(I can't edit it myself.)

@jreback
Copy link
Contributor

jreback commented May 28, 2021

updated

TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
zhengruifeng pushed a commit to apache/spark that referenced this pull request Aug 22, 2023
… from `read_csv` & enabling more tests

### What changes were proposed in this pull request?

This PR proposes to remove `squeeze` parameter from `read_csv` to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

### Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

### Does this PR introduce _any_ user-facing change?

`squeeze` will be no longer available from `read_csv`. Otherwise, it's test-only.

### How was this patch tested?

Enabling & updating the existing tests.

Closes #42551 from itholic/pandas_remaining_tests.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
… from `read_csv` & enabling more tests

### What changes were proposed in this pull request?

This PR proposes to remove `squeeze` parameter from `read_csv` to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

### Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

### Does this PR introduce _any_ user-facing change?

`squeeze` will be no longer available from `read_csv`. Otherwise, it's test-only.

### How was this patch tested?

Enabling & updating the existing tests.

Closes apache#42551 from itholic/pandas_remaining_tests.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
… from `read_csv` & enabling more tests

### What changes were proposed in this pull request?

This PR proposes to remove `squeeze` parameter from `read_csv` to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

### Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

### Does this PR introduce _any_ user-facing change?

`squeeze` will be no longer available from `read_csv`. Otherwise, it's test-only.

### How was this patch tested?

Enabling & updating the existing tests.

Closes apache#42551 from itholic/pandas_remaining_tests.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: read_csv handling of bad lines (too many/few fields)
4 participants