Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming string dtypes #1617

Open
ParthivNaresh opened this issue Jan 13, 2023 · 0 comments

Comments

@ParthivNaresh
Copy link
Collaborator

Currently, if a user creates a Pandas dataframe and passes it into Woodwork, certain dtypes are already inferred in Pandas which makes inference significantly easier. However there might be cases where all incoming data is in the form of text and has a dtype of string.

For a dataframe initialized like this:

df = pd.DataFrame()
df["ints"] = [i for i in range(100)]
df["floats"] = [i*1.1 for i in range(100)]
df["bools"] = [True, False, False, True, False] * 20
df["bools_nan"] = [True, False, False, True, pd.NA] * 20
df["strings"] = [f"{i}" for i in range(100)]
df["categoricals"] = np.random.choice(["Yellow", "Blue", "Red"], 100)

Subsequent Woodwork initialization yields as expected:
Screen Shot 2023-01-13 at 4 03 12 PM

But conversion of all dtypes to string prior to Woodwork initialization

for col in df.columns:
    df[col] = df[col].astype("string")

Yields this:
Screen Shot 2023-01-13 at 4 03 21 PM

This spike covers investigation into what solution(s) exist for this and how/in what order it should be tackled (by logical type, or is there an approach that can tackle all at once).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant