Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC replace pandas with Polars in examples/gaussian_process/plot_gpr_co2.py #28804

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Apr 10, 2024

Reference Issues/PRs

Related to #28341

What does this implement/fix? Explain your changes.

Replacing pandas with Polars allows to avoid checking pandas version when resampling monthly CO2.

Copy link

github-actions bot commented Apr 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c423b91. Link to the linter CI: here

@raisadz raisadz changed the title DOC replacing pandas with Polars in examples/gaussian_process/plot_gpr_co2.py DOC replace pandas with Polars in examples/gaussian_process/plot_gpr_co2.py Apr 10, 2024
@raisadz raisadz marked this pull request as ready for review April 10, 2024 14:47
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

co2_data = co2.frame
co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]])
co2_data = co2_data[["date", "co2"]].set_index("date")
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a round trip here? pl.DataFrame(pd.DataFrame) doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. Converting from/to pandas requires pyarrow as a dependency, and because scikit-learn doesn't have it, the following error will occur when trying to convert a pandas dataframe to Polars:

ModuleNotFoundError: pa.array requires 'pyarrow' module to be installed

There is an old related issue converting pl.to_pandas pola-rs/polars#3398

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue on the polars side: pola-rs/polars#15845

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we can do:

Suggested change
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})
co2_data = pl.from_dataframe(co2.frame)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pl.from_dataframe still needs pyarrow for older Polars versions and the checks fail for the minimum Polars version 0.19.12,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't really depend on polars and it's only used in our CI for docs, I'd be in favor of moving the min version to something that supports this.

WDYT @ogrisel @thomasjpfan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that pl.from_dataframe works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that pl.from_dataframe works without pyarrow starting from Polars version 0.20.4. I can increase the minimum required Polars version from 0.19.12 to 0.20.4 if you prefer this option.

I am okay with this option.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR now uses pola-rs/polars#15933 and directly calls pl.DataFrame which doesn't require PyArrow anymore for the types in the input here.

However, that means minimum required version is the one released yesterday (28.04.2024). I'm personally okay with this. But I'd rather have another set of eyes on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants