[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes #1617

ParthivNaresh · 2023-01-13T21:06:48Z

Currently, if a user creates a Pandas dataframe and passes it into Woodwork, certain dtypes are already inferred in Pandas which makes inference significantly easier. However there might be cases where all incoming data is in the form of text and has a dtype of string.

For a dataframe initialized like this:

df = pd.DataFrame()
df["ints"] = [i for i in range(100)]
df["floats"] = [i*1.1 for i in range(100)]
df["bools"] = [True, False, False, True, False] * 20
df["bools_nan"] = [True, False, False, True, pd.NA] * 20
df["strings"] = [f"{i}" for i in range(100)]
df["categoricals"] = np.random.choice(["Yellow", "Blue", "Red"], 100)

Subsequent Woodwork initialization yields as expected:

But conversion of all dtypes to string prior to Woodwork initialization

for col in df.columns:
    df[col] = df[col].astype("string")

Yields this:

This spike covers investigation into what solution(s) exist for this and how/in what order it should be tackled (by logical type, or is there an approach that can tackle all at once).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes #1617

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes #1617

ParthivNaresh commented Jan 13, 2023

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming string dtypes #1617

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming string dtypes #1617

Comments

ParthivNaresh commented Jan 13, 2023

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes #1617

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes #1617