Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support pandas 2 in flytekit #1818

Merged
merged 63 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
6c15fda
remove pandas
pingsutw Sep 3, 2023
e4a5e55
update dev requirement
pingsutw Sep 16, 2023
1b49769
update dev requirement
pingsutw Sep 16, 2023
909fbce
update dev requirement
pingsutw Sep 16, 2023
f01e3b7
fix tests
pingsutw Sep 16, 2023
3962fe6
return dummy module if module is not installed
pingsutw Oct 10, 2023
015a0ac
Merge branch 'master' of github.com:flyteorg/flytekit into remove-pandas
pingsutw Oct 10, 2023
68c032c
nit
pingsutw Oct 10, 2023
30eee83
fixed tests
pingsutw Oct 10, 2023
0347700
nit
pingsutw Oct 10, 2023
8e0b58f
fixed tests
pingsutw Oct 10, 2023
4fb0c03
fixed tests
pingsutw Oct 10, 2023
4d30607
fixed tests
pingsutw Oct 10, 2023
e2b5450
fixed tests
pingsutw Oct 10, 2023
3e64dcb
nit
pingsutw Oct 10, 2023
bf968f3
Add tests
pingsutw Oct 10, 2023
ae53480
lint
pingsutw Oct 10, 2023
c2921d0
lint
pingsutw Oct 10, 2023
df88479
merged master
pingsutw Nov 6, 2023
6d5d9fb
merged master
pingsutw Nov 16, 2023
d3b8e5f
add pyarrow to dev
pingsutw Nov 16, 2023
2e88e74
merged master
pingsutw Nov 17, 2023
4edec78
merged master
pingsutw Nov 21, 2023
5f477c7
nit
pingsutw Nov 21, 2023
fdaa5f6
rename
pingsutw Nov 21, 2023
f6ab978
Test pandas 1 & 2
pingsutw Nov 25, 2023
10b00c3
Merged master
pingsutw Nov 25, 2023
e68da19
update GA workflow
pingsutw Nov 25, 2023
c57c0fe
update GA workflow
pingsutw Nov 25, 2023
94c9aca
unit test without pandas
pingsutw Nov 25, 2023
10d8466
fix tests
pingsutw Nov 25, 2023
e15910d
lint
pingsutw Nov 25, 2023
d5669da
fix tests
pingsutw Nov 25, 2023
85a431b
Add pyarrow back
pingsutw Nov 25, 2023
97bd5e9
Skip some tests
pingsutw Nov 26, 2023
242436d
Skip some tests
pingsutw Nov 26, 2023
bb7ab63
Fix tests
pingsutw Nov 26, 2023
22ba6c0
fix tests
pingsutw Nov 26, 2023
dc6119f
nit
pingsutw Nov 26, 2023
1dcba35
lint
pingsutw Nov 26, 2023
0e8ea2c
Fix tests
pingsutw Nov 26, 2023
b29ec36
Fix tests
pingsutw Nov 26, 2023
e6c3a0d
Skip specific test
pingsutw Nov 27, 2023
8a2d047
nit
pingsutw Nov 27, 2023
b311082
update-pytest-mark
pingsutw Nov 27, 2023
37af712
fix-test
pingsutw Nov 27, 2023
68fcc8f
lint
pingsutw Nov 27, 2023
b47f436
Fix tests
pingsutw Nov 28, 2023
788f918
fix-test
pingsutw Nov 28, 2023
18c4fb8
lint
pingsutw Nov 28, 2023
97b8547
fix-test
pingsutw Nov 28, 2023
1185631
uninstall -y
pingsutw Nov 28, 2023
351d673
fixed test
pingsutw Dec 15, 2023
d05a11f
Merged master
pingsutw Dec 15, 2023
30a986a
ni
pingsutw Dec 15, 2023
55c8847
update test_union_in_dataclass
pingsutw Dec 15, 2023
1da781d
lint
pingsutw Dec 15, 2023
cae2746
Merge branch 'master' of github.com:flyteorg/flytekit into remove-pandas
pingsutw Dec 15, 2023
a155a0e
Update github workflow
pingsutw Dec 15, 2023
915818f
nit
pingsutw Dec 15, 2023
cef8fae
force reinstall
pingsutw Dec 15, 2023
9a8278b
nit
pingsutw Dec 15, 2023
52a1143
nit
pingsutw Dec 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 6 additions & 2 deletions dev-requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,15 @@ torch<=2.0.0; python_version>='3.11' or platform_system!='Windows'
# Once a solution is found, this should be updated to support Windows as well.
python-magic; (platform_system=='Darwin' or platform_system=='Linux')

pillow
scikit-learn
types-protobuf
types-croniter
types-mock
autoflake

pillow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this would be too annoying for local development? We'll remember to install pandas + pyarrow to enable some for the test suite.

I am okay with it. For others maybe a comment:

Suggested change
pillow
# Optional runtime dependnecies: Uncomment to enable tests on pandas
# pandas
pillow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added pandas to dev-requirement, and uninstall it in the GA workflow instead.

pandas
pyarrow
numpy
scikit-learn
types-requests
prometheus-client
14 changes: 14 additions & 0 deletions flytekit/lazy_import/lazy_module.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
import importlib.util
import sys
import types

LAZY_MODULES = []


class LazyModule(types.ModuleType):
def __init__(self, module_name: str):
super().__init__(module_name)
self._module_name = module_name

def __getattribute__(self, attr):
raise ImportError(f"Module {object.__getattribute__(self, '_module_name')} is not yet installed.")


def is_imported(module_name):
"""
This function is used to check if a module has been imported by the regular import.
Expand All @@ -24,6 +34,10 @@ def lazy_module(fullname):
return sys.modules[fullname]
# https://docs.python.org/3/library/importlib.html#implementing-lazy-imports
spec = importlib.util.find_spec(fullname)
if spec is None:
# Return a dummy module if the module is not found in the python environment,
# so that we can raise a proper error when the user tries to access an attribute in the module.
return LazyModule(fullname)
loader = importlib.util.LazyLoader(spec.loader)
spec.loader = loader
module = importlib.util.module_from_spec(spec)
Expand Down
26 changes: 16 additions & 10 deletions flytekit/types/structured/basic_dfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@
from pathlib import Path
from typing import TypeVar

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from botocore.exceptions import NoCredentialsError
from fsspec.core import split_protocol, strip_protocol
from fsspec.utils import get_protocol

from flytekit import FlyteContext, logger
from flytekit import FlyteContext, lazy_module, logger
from flytekit.configuration import DataConfig
from flytekit.core.data_persistence import get_fsspec_storage_options
from flytekit.models import literals
Expand All @@ -24,6 +21,13 @@
StructuredDatasetEncoder,
)

if typing.TYPE_CHECKING:
import pandas as pd
import pyarrow as pa

Check warning on line 26 in flytekit/types/structured/basic_dfs.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/structured/basic_dfs.py#L25-L26

Added lines #L25 - L26 were not covered by tests
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
else:
pd = lazy_module("pandas")
pa = lazy_module("pyarrow")

T = TypeVar("T")


Expand Down Expand Up @@ -70,7 +74,7 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pd.DataFrame:
) -> "pd.DataFrame":
uri = flyte_value.uri
columns = None
kwargs = get_pandas_storage_options(uri=uri, data_config=ctx.file_access.data_config)
Expand Down Expand Up @@ -121,7 +125,7 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pd.DataFrame:
) -> "pd.DataFrame":
uri = flyte_value.uri
columns = None
kwargs = get_pandas_storage_options(uri=uri, data_config=ctx.file_access.data_config)
Expand All @@ -145,9 +149,9 @@
structured_dataset: StructuredDataset,
structured_dataset_type: StructuredDatasetType,
) -> literals.StructuredDataset:
uri = typing.cast(str, structured_dataset.uri) or ctx.file_access.join(
ctx.file_access.raw_output_prefix, ctx.file_access.get_random_string()
)
import pyarrow.parquet as pq
pingsutw marked this conversation as resolved.
Show resolved Hide resolved

uri = typing.cast(str, structured_dataset.uri) or ctx.file_access.get_random_remote_directory()
if not ctx.file_access.is_remote(uri):
Path(uri).mkdir(parents=True, exist_ok=True)
path = os.path.join(uri, f"{0:05}")
Expand All @@ -165,7 +169,9 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pa.Table:
) -> "pa.Table":
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
import pyarrow.parquet as pq

uri = flyte_value.uri
if not ctx.file_access.is_remote(uri):
Path(uri).parent.mkdir(parents=True, exist_ok=True)
Expand Down
3 changes: 0 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,8 @@
# TODO: remove upper-bound after fixing change in contract
"marshmallow-jsonschema>=0.12.0",
"mashumaro>=3.9.1",
"numpy",
"pandas>=1.0.0,<2.0.0",
# TODO: Remove upper-bound after protobuf community fixes it. https://github.com/flyteorg/flyte/issues/4359
"protobuf<4.25.0",
"pyarrow>=4.0.0,<11.0.0",
"python-json-logger>=2.0.0",
"pytimeparse>=1.1.8,<2.0.0",
"pyyaml!=6.0.0,!=5.4.0,!=5.4.1", # pyyaml is broken with cython 3: https://github.com/yaml/pyyaml/issues/601
Expand Down
2 changes: 1 addition & 1 deletion tests/flytekit/unit/core/test_type_hints.py
Original file line number Diff line number Diff line change
Expand Up @@ -858,7 +858,7 @@ def t1(a: int) -> pandas.DataFrame:

@task
def t2(df: pandas.DataFrame) -> pandas.DataFrame:
return df.append(pandas.DataFrame(data={"col1": [5, 10], "col2": [5, 10]}))
return pd.concat([df, pandas.DataFrame(data={"col1": [5, 10], "col2": [5, 10]})])

@workflow
def my_wf(a: int) -> pandas.DataFrame:
Expand Down
Empty file.
12 changes: 12 additions & 0 deletions tests/flytekit/unit/lazy_module/test_lazy_module.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import pytest

from flytekit.lazy_import.lazy_module import LazyModule, lazy_module


def test_lazy_module():
mod = lazy_module("pandas")
assert mod.__name__ == "pandas"
mod = lazy_module("fake_module")
assert isinstance(mod, LazyModule)
with pytest.raises(ImportError, match="Module fake_module is not yet installed."):
print(mod.attr)