Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support pandas 2 in flytekit #1818

Merged
merged 63 commits into from Dec 18, 2023
Merged

Add support pandas 2 in flytekit #1818

merged 63 commits into from Dec 18, 2023

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Sep 3, 2023

TL;DR

Remove pandas from default dependencies, and support pandas 2 in flytekit

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

^^^

Tracking Issue

flyteorg/flyte#3928

Follow-up issue

NA

Signed-off-by: Kevin Su <pingsutw@apache.org>
@cosmicBboy
Copy link
Contributor

should probably add pandas to dev-requirements.in

@cosmicBboy
Copy link
Contributor

I think we'll also need to add pandas to the Dockerfile

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw marked this pull request as ready for review September 16, 2023 07:31
@codecov
Copy link

codecov bot commented Sep 16, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (67470a3) 85.98% compared to head (52a1143) 85.97%.
Report is 1 commits behind head on master.

Files Patch % Lines
flytekit/types/structured/bigquery.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1818      +/-   ##
==========================================
- Coverage   85.98%   85.97%   -0.01%     
==========================================
  Files         308      308              
  Lines       22997    23027      +30     
  Branches     3474     3480       +6     
==========================================
+ Hits        19773    19798      +25     
+ Misses       2620     2618       -2     
- Partials      604      611       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw changed the title Remove pandas Add support pandas 2 in flytekit Sep 18, 2023
cosmicBboy
cosmicBboy previously approved these changes Sep 27, 2023
@eapolinario
Copy link
Collaborator

Very interesting that we don't have to change any of code in structured dataset (that probably means that we were already leaning on pyarrow in the way pandas 2 expects).

How else did you test this, @pingsutw ? @cosmicBboy , did you take this for a spin?

@kumare3
Copy link
Contributor

kumare3 commented Oct 5, 2023

so does this mean, that flytekit will simply work with pandas and pandas2? depending on what the user installs?

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw
Copy link
Member Author

I've run some e2e tests for it.

from flytekit import task, workflow, ImageSpec

new_flytekit = "git+https://github.com/flyteorg/flytekit.git@68c032ce06e5d05686955c22d7a50a762f8c1bb0"
image_spec = ImageSpec(base_image="python:3.8-slim-buster", packages=[new_flytekit], apt_packages=["git"], registry="pingsutw")


@task(disable_deck=False, container_image=image_spec)
def t1() -> str:
    md_text = "#Hello Flyte\n##dHeeeello Flyte\n###Hello Flyte"
    return md_text


@task(disable_deck=False, container_image=image_spec)
def t2() -> str:
    return "hello"


@workflow
def wf():
    t1()
    t2()
image
  • Task outputs a arrow table
import pyarrow as pa
from flytekit import task, workflow, ImageSpec
from flytekit.deck.renderer import ArrowRenderer
from typing_extensions import Annotated

new_flytekit = "git+https://github.com/flyteorg/flytekit.git@3e64dcbfdb518814baa9f0ca07358cf0af82905d"
image_spec = ImageSpec(base_image="python:3.8-slim-buster", packages=[new_flytekit], apt_packages=["git"], registry="pingsutw")


@task(disable_deck=False, container_image=image_spec)
def t1() -> Annotated[pa.Table, ArrowRenderer()]:
    n_legs = pa.array([2, 4, 5, 100])
    animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
    names = ["n_legs", "animals"]
    return pa.Table.from_arrays([n_legs, animals], names=names)


@workflow
def wf():
    t1()
image
  • Task outputs a pandas.DataFrame
import pandas as pd
from flytekit import task, workflow, ImageSpec
from flytekit.deck.renderer import TopFrameRenderer
from typing_extensions import Annotated


new_flytekit = "git+https://github.com/flyteorg/flytekit.git@3e64dcbfdb518814baa9f0ca07358cf0af82905d"

image_spec = ImageSpec(base_image="python:3.8-slim-buster", packages=[new_flytekit, "pandas"], apt_packages=["git"], registry="pingsutw")


@task(disable_deck=False, container_image=image_spec)
def t1() -> Annotated[pd.DataFrame, TopFrameRenderer()]:
    return pd.DataFrame({"col1": [1, 2, 3], "col2": list("abc")})


@workflow
def wf():
    t1()
image
  • Task outputs a numpy array
import numpy as np
from flytekit import task, workflow, ImageSpec

new_flytekit = "git+https://github.com/flyteorg/flytekit.git@3e64dcbfdb518814baa9f0ca07358cf0af82905d"
image_spec = ImageSpec(base_image="python:3.8-slim-buster", packages=[new_flytekit, "numpy"], apt_packages=["git"], registry="pingsutw")


@task(disable_deck=False, container_image=image_spec)
def t1() -> np.array:

    return np.array([1, 2, 3])


@workflow
def wf():
    t1()
image

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@thomasjpfan
Copy link
Member

From other projects, I have heard that lazy loading can break workflows. This has came up in the flytekit's databricks plugin: flyteorg/flyte#3853

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw marked this pull request as ready for review December 15, 2023 12:07
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw merged commit 74f2f53 into master Dec 18, 2023
76 of 77 checks passed
@cameronraysmith
Copy link
Contributor

Thank you!

@ringohoffman
Copy link
Contributor

@pingsutw Congrats on getting this merged. Do you know when this will be released?

@pingsutw
Copy link
Member Author

pingsutw commented Jan 4, 2024

@ringohoffman we plan to start cutting a new release next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants