New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip-feat: pandas as soft dependency #3384
base: main
Are you sure you want to change the base?
Conversation
…nd index attributes over type
Great to get the ball rolling on this, thank you @mattijn! I did not yet have time to review but just wanted to say that I'm happy to have a look at the types once I get to it. As long as the package works, I'm optimistic that we can make mypy happy. |
Thanks @binste! No rush! Maybe something for version 5.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some first comments. I haven't had the chance yet to run mypy on this PR (reviewed it in the browser) but I have some ideas how to make it work which I want to try out depending on the errors it throws.
|
||
|
||
def import_pandas() -> ModuleType: | ||
min_version = "0.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment in pyproject.toml
? Next to the pandas requirement that if the pandas version is updated, it also needs to be changed here. Although I'm realizing now that that file needs to be changed anyway to make pandas optional
return curried.pipe(data, data_transformers.get()) | ||
elif isinstance(data, str): | ||
return {"url": data} | ||
elif _is_pandas_dataframe(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?
If yes, could you add a comment about this?
@@ -53,6 +52,11 @@ def __dataframe__( | |||
) -> DfiDataFrame: ... | |||
|
|||
|
|||
def _is_pandas_dataframe(obj: Any) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be a simple isinstance(obj, pd.DataFrame)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first.
I tried setting up a function on which I can do some duck typing
def instance(obj):
return type(obj).__name__
But found out that both polars and pandas are using the instance type DataFrame
for their dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else.
I wish we could sniff the type without importing modules first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if should_load
is False
then it won't perform the import even if the library is installed. This was used with isinstance
checks, because if pandas
hasn't been loaded yet, you know the object you're dealing with isn't a pandas DataFrame, even if pandas is installed.
return pd | ||
except ImportError as err: | ||
raise ImportError( | ||
f"Serialization of the DataFrame requires\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Serialization of the DataFrame requires\n" | |
f"Serialization of this data requires\n" |
It can also be a dict
as in data.py: _data_to_csv_string
. Furthermore, if it's a dataframe, it's already given that Pandas is installed.
if TYPE_CHECKING: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if TYPE_CHECKING: | |
pass |
Aware that it's just a wip PR, thought I'd just note it anyway :)
class _PandasTimestamp: | ||
def isoformat(self): | ||
return "dummy_isoformat" # Return a dummy ISO format string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should inherit from a Protocol as a pd.Timestamp is not an instance of _PandasTimestamp
. You'll then also need to add the @runtime_checkable
decorator from typing
. Also, we could directly test for a pandas timestamp in a similar function to is_pandas_dataframe
to keep these approaches consistent?
@@ -4,11 +4,11 @@ | |||
|
|||
import numpy as np | |||
import pandas as pd | |||
from pandas.api.types import infer_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future
This PR is an attempt to make pandas a soft dependency. I hope it can be used as inspiration, as I was not able to make the types happy. I've no real idea how it should be done, but I've been trying a few things, some with success and others without.
I also made an attempt to prioritize the DataFrameLike approach over the pandas routine, but decided to not do this as otherwise usage of a pandas DataFrame within Altair will require pyarrow to infer/serialize. My current feeling is that usage of pandas to infer and serialize the data is still preferred as it is not yet depending on pyarrow.