Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip-feat: pandas as soft dependency #3384

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mattijn
Copy link
Contributor

@mattijn mattijn commented Mar 25, 2024

This PR is an attempt to make pandas a soft dependency. I hope it can be used as inspiration, as I was not able to make the types happy. I've no real idea how it should be done, but I've been trying a few things, some with success and others without.

I also made an attempt to prioritize the DataFrameLike approach over the pandas routine, but decided to not do this as otherwise usage of a pandas DataFrame within Altair will require pyarrow to infer/serialize. My current feeling is that usage of pandas to infer and serialize the data is still preferred as it is not yet depending on pyarrow.

@binste
Copy link
Contributor

binste commented Mar 29, 2024

Great to get the ball rolling on this, thank you @mattijn! I did not yet have time to review but just wanted to say that I'm happy to have a look at the types once I get to it. As long as the package works, I'm optimistic that we can make mypy happy.

@mattijn
Copy link
Contributor Author

mattijn commented Mar 29, 2024

Thanks @binste! No rush! Maybe something for version 5.4

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some first comments. I haven't had the chance yet to run mypy on this PR (reviewed it in the browser) but I have some ideas how to make it work which I want to try out depending on the errors it throws.



def import_pandas() -> ModuleType:
min_version = "0.25"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment in pyproject.toml? Next to the pandas requirement that if the pandas version is updated, it also needs to be changed here. Although I'm realizing now that that file needs to be changed anyway to make pandas optional

return curried.pipe(data, data_transformers.get())
elif isinstance(data, str):
return {"url": data}
elif _is_pandas_dataframe(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?

If yes, could you add a comment about this?

@@ -53,6 +52,11 @@ def __dataframe__(
) -> DfiDataFrame: ...


def _is_pandas_dataframe(obj: Any) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function be a simple isinstance(obj, pd.DataFrame)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first.

I tried setting up a function on which I can do some duck typing

def instance(obj):
    return type(obj).__name__

But found out that both polars and pandas are using the instance type DataFrame for their dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else.
I wish we could sniff the type without importing modules first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if should_load is False then it won't perform the import even if the library is installed. This was used with isinstance checks, because if pandas hasn't been loaded yet, you know the object you're dealing with isn't a pandas DataFrame, even if pandas is installed.

return pd
except ImportError as err:
raise ImportError(
f"Serialization of the DataFrame requires\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Serialization of the DataFrame requires\n"
f"Serialization of this data requires\n"

It can also be a dict as in data.py: _data_to_csv_string. Furthermore, if it's a dataframe, it's already given that Pandas is installed.

Comment on lines +47 to +49
if TYPE_CHECKING:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if TYPE_CHECKING:
pass

Aware that it's just a wip PR, thought I'd just note it anyway :)

Comment on lines +51 to +53
class _PandasTimestamp:
def isoformat(self):
return "dummy_isoformat" # Return a dummy ISO format string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should inherit from a Protocol as a pd.Timestamp is not an instance of _PandasTimestamp. You'll then also need to add the @runtime_checkable decorator from typing. Also, we could directly test for a pandas timestamp in a similar function to is_pandas_dataframe to keep these approaches consistent?

@@ -4,11 +4,11 @@

import numpy as np
import pandas as pd
from pandas.api.types import infer_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants