Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support pandas 2 in flytekit #1818

Merged
merged 63 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
6c15fda
remove pandas
pingsutw Sep 3, 2023
e4a5e55
update dev requirement
pingsutw Sep 16, 2023
1b49769
update dev requirement
pingsutw Sep 16, 2023
909fbce
update dev requirement
pingsutw Sep 16, 2023
f01e3b7
fix tests
pingsutw Sep 16, 2023
3962fe6
return dummy module if module is not installed
pingsutw Oct 10, 2023
015a0ac
Merge branch 'master' of github.com:flyteorg/flytekit into remove-pandas
pingsutw Oct 10, 2023
68c032c
nit
pingsutw Oct 10, 2023
30eee83
fixed tests
pingsutw Oct 10, 2023
0347700
nit
pingsutw Oct 10, 2023
8e0b58f
fixed tests
pingsutw Oct 10, 2023
4fb0c03
fixed tests
pingsutw Oct 10, 2023
4d30607
fixed tests
pingsutw Oct 10, 2023
e2b5450
fixed tests
pingsutw Oct 10, 2023
3e64dcb
nit
pingsutw Oct 10, 2023
bf968f3
Add tests
pingsutw Oct 10, 2023
ae53480
lint
pingsutw Oct 10, 2023
c2921d0
lint
pingsutw Oct 10, 2023
df88479
merged master
pingsutw Nov 6, 2023
6d5d9fb
merged master
pingsutw Nov 16, 2023
d3b8e5f
add pyarrow to dev
pingsutw Nov 16, 2023
2e88e74
merged master
pingsutw Nov 17, 2023
4edec78
merged master
pingsutw Nov 21, 2023
5f477c7
nit
pingsutw Nov 21, 2023
fdaa5f6
rename
pingsutw Nov 21, 2023
f6ab978
Test pandas 1 & 2
pingsutw Nov 25, 2023
10b00c3
Merged master
pingsutw Nov 25, 2023
e68da19
update GA workflow
pingsutw Nov 25, 2023
c57c0fe
update GA workflow
pingsutw Nov 25, 2023
94c9aca
unit test without pandas
pingsutw Nov 25, 2023
10d8466
fix tests
pingsutw Nov 25, 2023
e15910d
lint
pingsutw Nov 25, 2023
d5669da
fix tests
pingsutw Nov 25, 2023
85a431b
Add pyarrow back
pingsutw Nov 25, 2023
97bd5e9
Skip some tests
pingsutw Nov 26, 2023
242436d
Skip some tests
pingsutw Nov 26, 2023
bb7ab63
Fix tests
pingsutw Nov 26, 2023
22ba6c0
fix tests
pingsutw Nov 26, 2023
dc6119f
nit
pingsutw Nov 26, 2023
1dcba35
lint
pingsutw Nov 26, 2023
0e8ea2c
Fix tests
pingsutw Nov 26, 2023
b29ec36
Fix tests
pingsutw Nov 26, 2023
e6c3a0d
Skip specific test
pingsutw Nov 27, 2023
8a2d047
nit
pingsutw Nov 27, 2023
b311082
update-pytest-mark
pingsutw Nov 27, 2023
37af712
fix-test
pingsutw Nov 27, 2023
68fcc8f
lint
pingsutw Nov 27, 2023
b47f436
Fix tests
pingsutw Nov 28, 2023
788f918
fix-test
pingsutw Nov 28, 2023
18c4fb8
lint
pingsutw Nov 28, 2023
97b8547
fix-test
pingsutw Nov 28, 2023
1185631
uninstall -y
pingsutw Nov 28, 2023
351d673
fixed test
pingsutw Dec 15, 2023
d05a11f
Merged master
pingsutw Dec 15, 2023
30a986a
ni
pingsutw Dec 15, 2023
55c8847
update test_union_in_dataclass
pingsutw Dec 15, 2023
1da781d
lint
pingsutw Dec 15, 2023
cae2746
Merge branch 'master' of github.com:flyteorg/flytekit into remove-pandas
pingsutw Dec 15, 2023
a155a0e
Update github workflow
pingsutw Dec 15, 2023
915818f
nit
pingsutw Dec 15, 2023
cef8fae
force reinstall
pingsutw Dec 15, 2023
9a8278b
nit
pingsutw Dec 15, 2023
52a1143
nit
pingsutw Dec 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 16 additions & 1 deletion .github/workflows/pythonbuild.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ jobs:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.8", "3.11"]
install-pandas: ["true", "false"]
exclude:
- os: macos-latest
install-pandas: "false"
- os: windows-latest
install-pandas: "false"
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -34,6 +40,12 @@ jobs:
path: ~/.cache/pip
# Look to see if there is a cache hit for the corresponding requirements files
key: ${{ format('{0}-pip-{1}', runner.os, hashFiles('dev-requirements.in', 'requirements.in')) }}
- name: Install pandas v1
if: ${{ (matrix.python-version=='3.8') && (matrix.install-pandas=='true') }}
run: pip install "pandas<2.0.0"
- name: Install pandas v2
if: ${{ matrix.python-version=='3.11' && (matrix.install-pandas=='true') }}
run: pip install "pandas>=2.0.0"
- name: Install dependencies
run: |
make setup && pip freeze
Expand Down Expand Up @@ -106,7 +118,9 @@ jobs:
# Look to see if there is a cache hit for the corresponding requirements files
key: ${{ format('{0}-pip-{1}', runner.os, hashFiles('dev-requirements.in', 'requirements.in')) }}
- name: Install dependencies
run: make setup && pip freeze
run: |
pip install pandas
make setup && pip freeze
- name: Install FlyteCTL
uses: unionai-oss/flytectl-setup-action@master
- name: Setup Flyte Sandbox
Expand Down Expand Up @@ -217,6 +231,7 @@ jobs:
key: ${{ format('{0}-pip-{1}', runner.os, hashFiles('dev-requirements.txt', format('plugins/{0}/requirements.txt', matrix.plugin-names ))) }}
- name: Install dependencies
run: |
pip install pandas
make setup
cd plugins/${{ matrix.plugin-names }}
pip install .
Expand Down
6 changes: 4 additions & 2 deletions dev-requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,13 @@ torch<=2.0.0; python_version>='3.11' or platform_system!='Windows'
# Once a solution is found, this should be updated to support Windows as well.
python-magic; (platform_system=='Darwin' or platform_system=='Linux')

pillow
scikit-learn
types-protobuf
types-croniter
types-mock
autoflake

pillow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this would be too annoying for local development? We'll remember to install pandas + pyarrow to enable some for the test suite.

I am okay with it. For others maybe a comment:

Suggested change
pillow
# Optional runtime dependnecies: Uncomment to enable tests on pandas
# pandas
pillow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added pandas to dev-requirement, and uninstall it in the GA workflow instead.

numpy
scikit-learn
types-requests
prometheus-client
11 changes: 7 additions & 4 deletions flytekit/extras/sqlite3/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,20 @@
import tempfile
import typing
from dataclasses import dataclass

import pandas as pd

from flytekit import FlyteContext, kwtypes
from flytekit import FlyteContext, kwtypes, lazy_module
from flytekit.configuration import DefaultImages, SerializationSettings
from flytekit.core.base_sql_task import SQLTask
from flytekit.core.python_customized_container_task import PythonCustomizedContainerTask
from flytekit.core.shim_task import ShimTaskExecutor
from flytekit.models import task as task_models


if typing.TYPE_CHECKING:
import pandas as pd

Check warning on line 17 in flytekit/extras/sqlite3/task.py

View check run for this annotation

Codecov / codecov/patch

flytekit/extras/sqlite3/task.py#L17

Added line #L17 was not covered by tests
else:
pd = lazy_module("pandas")


def unarchive_file(local_path: str, to_dir: str):
"""
Unarchive given archive and returns the unarchived file name. It is expected that only one file is unarchived.
Expand Down
14 changes: 14 additions & 0 deletions flytekit/lazy_import/lazy_module.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
import importlib.util
import sys
import types

LAZY_MODULES = []


class LazyModule(types.ModuleType):
def __init__(self, module_name: str):
super().__init__(module_name)
self._module_name = module_name

def __getattribute__(self, attr):
raise ImportError(f"Module {object.__getattribute__(self, '_module_name')} is not yet installed.")


def is_imported(module_name):
"""
This function is used to check if a module has been imported by the regular import.
Expand All @@ -24,6 +34,10 @@ def lazy_module(fullname):
return sys.modules[fullname]
# https://docs.python.org/3/library/importlib.html#implementing-lazy-imports
spec = importlib.util.find_spec(fullname)
if spec is None:
# Return a dummy module if the module is not found in the python environment,
# so that we can raise a proper error when the user tries to access an attribute in the module.
return LazyModule(fullname)
loader = importlib.util.LazyLoader(spec.loader)
spec.loader = loader
module = importlib.util.module_from_spec(spec)
Expand Down
26 changes: 16 additions & 10 deletions flytekit/types/structured/basic_dfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@
from pathlib import Path
from typing import TypeVar

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from botocore.exceptions import NoCredentialsError
from fsspec.core import split_protocol, strip_protocol
from fsspec.utils import get_protocol

from flytekit import FlyteContext, logger
from flytekit import FlyteContext, lazy_module, logger
from flytekit.configuration import DataConfig
from flytekit.core.data_persistence import get_fsspec_storage_options
from flytekit.models import literals
Expand All @@ -24,6 +21,13 @@
StructuredDatasetEncoder,
)

if typing.TYPE_CHECKING:
import pandas as pd
import pyarrow as pa

Check warning on line 26 in flytekit/types/structured/basic_dfs.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/structured/basic_dfs.py#L25-L26

Added lines #L25 - L26 were not covered by tests
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
else:
pd = lazy_module("pandas")
pa = lazy_module("pyarrow")

T = TypeVar("T")


Expand Down Expand Up @@ -70,7 +74,7 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pd.DataFrame:
) -> "pd.DataFrame":
uri = flyte_value.uri
columns = None
kwargs = get_pandas_storage_options(uri=uri, data_config=ctx.file_access.data_config)
Expand Down Expand Up @@ -121,7 +125,7 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pd.DataFrame:
) -> "pd.DataFrame":
uri = flyte_value.uri
columns = None
kwargs = get_pandas_storage_options(uri=uri, data_config=ctx.file_access.data_config)
Expand All @@ -145,9 +149,9 @@
structured_dataset: StructuredDataset,
structured_dataset_type: StructuredDatasetType,
) -> literals.StructuredDataset:
uri = typing.cast(str, structured_dataset.uri) or ctx.file_access.join(
ctx.file_access.raw_output_prefix, ctx.file_access.get_random_string()
)
import pyarrow.parquet as pq
pingsutw marked this conversation as resolved.
Show resolved Hide resolved

uri = typing.cast(str, structured_dataset.uri) or ctx.file_access.get_random_remote_directory()
if not ctx.file_access.is_remote(uri):
Path(uri).mkdir(parents=True, exist_ok=True)
path = os.path.join(uri, f"{0:05}")
Expand All @@ -165,7 +169,9 @@
ctx: FlyteContext,
flyte_value: literals.StructuredDataset,
current_task_metadata: StructuredDatasetMetadata,
) -> pa.Table:
) -> "pa.Table":
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
import pyarrow.parquet as pq

uri = flyte_value.uri
if not ctx.file_access.is_remote(uri):
Path(uri).parent.mkdir(parents=True, exist_ok=True)
Expand Down
11 changes: 8 additions & 3 deletions flytekit/types/structured/bigquery.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
import re
import typing

import pandas as pd
import pyarrow as pa
from google.cloud import bigquery, bigquery_storage
from google.cloud.bigquery_storage_v1 import types

from flytekit import FlyteContext
from flytekit import FlyteContext, lazy_module

Check warning on line 7 in flytekit/types/structured/bigquery.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/structured/bigquery.py#L7

Added line #L7 was not covered by tests
from flytekit.models import literals
from flytekit.models.types import StructuredDatasetType
from flytekit.types.structured.structured_dataset import (
Expand All @@ -16,6 +14,13 @@
StructuredDatasetMetadata,
)

if typing.TYPE_CHECKING:
import pandas as pd
import pyarrow as pa

Check warning on line 19 in flytekit/types/structured/bigquery.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/structured/bigquery.py#L18-L19

Added lines #L18 - L19 were not covered by tests
else:
pd = lazy_module("pandas")
pa = lazy_module("pyarrow")

Check warning on line 22 in flytekit/types/structured/bigquery.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/structured/bigquery.py#L21-L22

Added lines #L21 - L22 were not covered by tests

BIGQUERY = "bq"


Expand Down
3 changes: 2 additions & 1 deletion flytekit/types/structured/structured_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import collections
import sys
import types
import typing
from abc import ABC, abstractmethod
Expand Down Expand Up @@ -150,7 +151,7 @@ def extract_cols_and_format(
if ordered_dict_cols is not None:
raise ValueError(f"Column information was already found {ordered_dict_cols}, cannot use {aa}")
ordered_dict_cols = aa
elif isinstance(aa, pa.lib.Schema):
elif "pyarrow" in sys.modules and isinstance(aa, pa.lib.Schema):
if pa_schema is not None:
raise ValueError(f"Arrow schema was already found {pa_schema}, cannot use {aa}")
pa_schema = aa
Expand Down
2 changes: 1 addition & 1 deletion plugins/flytekit-duckdb/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "duckdb"]
plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "duckdb", "pandas"]

__version__ = "0.0.0+develop"

Expand Down
2 changes: 1 addition & 1 deletion plugins/flytekit-mlflow/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

# TODO: support mlflow 2.0+
plugin_requires = ["flytekit>=1.1.0,<2.0.0", "plotly", "mlflow<2.0.0"]
plugin_requires = ["flytekit>=1.1.0,<2.0.0", "plotly", "mlflow<2.0.0", "pandas"]

__version__ = "0.0.0+develop"

Expand Down
5 changes: 1 addition & 4 deletions plugins/flytekit-polars/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = [
"flytekit>=1.3.0b2,<2.0.0",
"polars>=0.8.27,<0.17.0",
]
plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "polars>=0.8.27,<0.17.0", "pandas"]

__version__ = "0.0.0+develop"

Expand Down
2 changes: 1 addition & 1 deletion plugins/flytekit-spark/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "pyspark>=3.0.0", "aiohttp", "flyteidl>=1.10.0"]
plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "pyspark>=3.0.0", "aiohttp", "flyteidl>=1.10.0", "pandas"]

__version__ = "0.0.0+develop"

Expand Down
2 changes: 1 addition & 1 deletion plugins/flytekit-sqlalchemy/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "sqlalchemy>=1.4.7"]
plugin_requires = ["flytekit>=1.3.0b2,<2.0.0", "sqlalchemy>=1.4.7", "pandas"]

__version__ = "0.0.0+develop"

Expand Down
4 changes: 1 addition & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,9 @@
# TODO: remove upper-bound after fixing change in contract
"marshmallow-jsonschema>=0.12.0",
"mashumaro>=3.9.1",
"numpy",
"pandas>=1.0.0,<2.0.0",
# TODO: Remove upper-bound after protobuf community fixes it. https://github.com/flyteorg/flyte/issues/4359
"protobuf<4.25.0",
"pyarrow>=4.0.0,<11.0.0",
"pyarrow",
"python-json-logger>=2.0.0",
"pytimeparse>=1.1.8,<2.0.0",
"pyyaml!=6.0.0,!=5.4.0,!=5.4.1", # pyyaml is broken with cython 3: https://github.com/yaml/pyyaml/issues/601
Expand Down
3 changes: 2 additions & 1 deletion tests/flytekit/unit/cli/pyflyte/test_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import pathlib
import sys
from enum import Enum

import mock
import pytest
from click.testing import CliRunner
Expand All @@ -16,6 +15,8 @@
from flytekit.interaction.click_types import FileParamType
from flytekit.remote import FlyteRemote

pytest.importorskip("pandas")

WORKFLOW_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "workflow.py")
REMOTE_WORKFLOW_FILE = "https://raw.githubusercontent.com/flyteorg/flytesnacks/8337b64b33df046b2f6e4cba03c74b7bdc0c4fb1/cookbook/core/flyte_basics/basic_workflow.py"
IMPERATIVE_WORKFLOW_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "imperative_wf.py")
Expand Down
2 changes: 2 additions & 0 deletions tests/flytekit/unit/core/test_checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
from flytekit.core.checkpointer import SyncCheckpoint
from flytekit.core.local_cache import LocalTaskCache

pytest.importorskip("pandas")


def test_sync_checkpoint_write(tmpdir):
td_path = Path(tmpdir)
Expand Down
2 changes: 2 additions & 0 deletions tests/flytekit/unit/core/test_complex_nesting.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile

pytest.importorskip("pandas")


@dataclass
class MyProxyConfiguration(DataClassJsonMixin):
Expand Down
5 changes: 3 additions & 2 deletions tests/flytekit/unit/core/test_data_persistence.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import pytest
import io
import os
import pathlib
Expand All @@ -6,11 +7,11 @@
import tempfile

import mock
import pandas as pd
from azure.identity import ClientSecretCredential, DefaultAzureCredential

from flytekit.core.data_persistence import FileAccessProvider

pd = pytest.importorskip("pandas")


def test_get_manual_random_remote_path():
fp = FileAccessProvider("/tmp", "s3://my-bucket")
Expand Down
3 changes: 3 additions & 0 deletions tests/flytekit/unit/core/test_dataclass.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
from dataclasses import dataclass
from typing import List

import pytest
from dataclasses_json import DataClassJsonMixin

from flytekit.core.task import task
from flytekit.core.workflow import workflow

pytest.importorskip("pandas")


def test_dataclass():
@dataclass
Expand Down
13 changes: 8 additions & 5 deletions tests/flytekit/unit/core/test_imperative.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
import pytest
import typing
from collections import OrderedDict

import pandas as pd
import pytest

import flytekit.configuration
from flytekit.configuration import Image, ImageConfig
from flytekit.core.base_task import kwtypes
Expand All @@ -15,7 +12,13 @@
from flytekit.models import literals as literal_models
from flytekit.tools.translator import get_serializable
from flytekit.types.file import FlyteFile
from flytekit.types.schema import FlyteSchema

if typing.TYPE_CHECKING:
import pandas as pd
else:
pd = pytest.importorskip("pandas")
from flytekit.types.schema import FlyteSchema # noqa: E402


default_img = Image(name="default", fqn="test", tag="tag")
serialization_settings = flytekit.configuration.SerializationSettings(
Expand Down