Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predict does not work because of data type mismatch for same dataframe. #4124

Open
Mhsh opened this issue Apr 5, 2023 · 10 comments
Open

Predict does not work because of data type mismatch for same dataframe. #4124

Mhsh opened this issue Apr 5, 2023 · 10 comments
Assignees

Comments

@Mhsh
Copy link

Mhsh commented Apr 5, 2023

I have trained a dataset using eval ml and below is the best fit pipeline.

pipeline = RegressionPipeline(component_graph={'Replace Nullable Types Transformer': ['Replace Nullable Types Transformer', 'X', 'y'], 'Imputer': ['Imputer', 'Replace Nullable Types Transformer.x', 'Replace Nullable Types Transformer.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Replace Nullable Types Transformer.y'], 'Random Forest Regressor': ['Random Forest Regressor', 'One Hot Encoder.x', 'Replace Nullable Types Transformer.y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0)

I have stored the model and loading it with below code and everything is working fine.
Model Store
import pickle
best_pipeline.save(MODEL_NAME+'.pkl')

Model Load
with open('BidPrediction.pkl', "rb") as f:
model = pickle.load(f)
try:
df = X_train
print(model.predict(df))
except Exception as e:
print(e)

Ouput:
501 117.129174
90 343.367964
527 153.735972
576 225.164953
200 140.293371
...
277 222.844593
9 1127.711225
359 1385.688900
192 146.658027
559 833.751599

Name: Ideal Bid Price/Cts ($), Length: 485, dtype: float64

ISSUE
Now I am trying to use this model to predict a single data from dataframe and it is giving below error.

try:
df = X_train.head(1)
print(model.predict(df))
except Exception as e:
print(e)

ERROR:
Input X data types are different from the input types the pipeline was fitted on.

When I tried to inspect the error it seems the problem is that there is different data type are assigned to pipeline as compared with input feature. My guess is that it works with whole dataset because there are null entries in 'Target Days' and 'Actual Days' whereas it is not null when single instance is passed.

ERROR details:

{'input_features_types': Logical Type Semantic Tag(s)
Column
Rough Type Categorical ['category']
Source Categorical ['category']
Avg Size Double ['numeric']
Manufaturing Rate Per Cts ($) Double ['numeric']
Expected Color Variation % Double ['numeric']
Expected Polish Variation % Double ['numeric']
Profit Margin % Double ['numeric']
Sales Rate Per Cts Double ['numeric']
Target Days Integer ['numeric']
Actual Days Integer ['numeric']
Interest Paid/Cts Double ['numeric'], 'pipeline_features_types': Logical Type Semantic Tag(s)
Column
Rough Type Categorical ['category']
Source Categorical ['category']
Avg Size Double ['numeric']
Manufaturing Rate Per Cts ($) Double ['numeric']
Expected Color Variation % Double ['numeric']
Expected Polish Variation % Double ['numeric']
Profit Margin % Double ['numeric']
Sales Rate Per Cts Double ['numeric']
Target Days IntegerNullable ['numeric']
Actual Days IntegerNullable ['numeric']
Interest Paid/Cts Double ['numeric']}

Not sure how to convert the Integer to IntegerNullable as I am getting error when I try to use single row of dataframe for prediction.

Note- I will deploy this model so mostly single record with come for prediction.

@Mhsh Mhsh changed the title Predict does not work because of . Predict does not work because of data type mismatch for same dataframe. Apr 5, 2023
@eccabay
Copy link
Contributor

eccabay commented Apr 6, 2023

Thanks for filing this @Mhsh! We'll take a look at this issue and get back to you soon.

@Mhsh
Copy link
Author

Mhsh commented Apr 7, 2023

I had the same problem for category data. When I was passing the single record for prediction the same above error occurred and then I have to manually convert the data type to 'category' as below.

X_train['Rough Type'].head(1).astype('category')

@jeremyliweishih
Copy link
Contributor

@Mhsh do you mind showing the stack trace for the categorical data case like for the integer nullable case? Thanks!

@tamargrey
Copy link
Contributor

We should be able to handle the Integer/IntegerNullable case with #4077.

Seeing @Mhsh's stack trace for the categorical data will be helpful to decide if/how to handle this since the Category logical type can already handle nullable types. I wonder if the presence of nullable types is changing the woodwork type that's inferred? If that's the case, I don't think any fix for #4077 would apply here, as we couldn't use the two types interchangeably. We could, however, discuss other changes to _schema_is_equal that we could allow that would help in this situation.

@Mhsh
Copy link
Author

Mhsh commented Apr 10, 2023

I got below error when I passed the single dataframe for prediction. This was not related to nullable object or something but the data (categorical) which I was passing in dataframe was not recognised as categorical logical type.

image

When I tried to see the details below was the response.

image

The above code works when I pass X_test dataframe which contains record which imitates the X_train data.

image

@tamargrey
Copy link
Contributor

@Mhsh the Categorical data is being inferred as Unknown when there is only one row. A workaround that should work for any of these type differences while we implement a fix via #4133 would be to initialize Woodwork on df with the types from X_train.

df.ww.init(schema=X_train.ww.schema)
model.predict(df)

@Mhsh
Copy link
Author

Mhsh commented Apr 17, 2023

Thanks @tamargrey. The code is working with the workaround that you provided.

@gautamborad
Copy link

@tamargrey , am getting the below error when i tried to set the types from X_train:

TypeConversionError: Error converting datatype for SKEW(orders.NUM_UNIQUE(order_products.department)) from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.

Am following the tutorial here: https://compose.alteryx.com/en/stable/examples/predict_next_purchase.html. Following are my changes:

fm = ft.calculate_feature_matrix(
    features=fd,
    entityset=es,
    cutoff_time=ft.pd.Timestamp("2015-07-02"),
    cutoff_time_in_index=True,
    verbose=False,
)

display(fm.head())

fm.ww.init(schema=X_train.ww.schema) <== Giving error on this line

y_pred = best_pipeline.predict(fm)
y_pred = y_pred.values

prediction = fm[[]]
prediction["bought_product (estimate)"] = y_pred
prediction.head()

@tamargrey
Copy link
Contributor

@gautamborad I expect this error is happening because fm contains columns that aren't in X_train (for example - SKEW(orders.NUM_UNIQUE(order_products.department)) is presumably an engineered feature defined in fd), so fm.ww.init(schema=X_train.ww.schema) will perform type inference on any columns that aren't in X_train.

If SKEW(orders.NUM_UNIQUE(order_products.department)) would get inferred as IntegerNullable but had the float64 dtype coming out of calculate_feature_matrix, that would cause this error.

if you want to update the logical types of only the columns in fm that are also in X_train, I'd suggest using set_types instead: fm.ww.set_types(logical_types=X_train.ww.logical_types). This of course will not work if there are columns in X_train that aren't in fm, so that may be something you have to account for.

@gautamborad
Copy link

gautamborad commented Jun 13, 2023

@tamargrey thanks for the quick reply! I think the columns in both fm and X_train matches, its just that in fm some of the types are not inferred properly. I can set them manually, but would be great it i could just copy the schema.

set(fm.columns) == set(X_train.columns)

True
ww_df = fm.ww.schema.types.reset_index()
X_df = X_train.ww.schema.types
for i, r in ww_df.iterrows():
    col = r["Column"]
    lt = r["Logical Type"]
    if not any((X_df.index == col) & (X_df['Logical Type'] == lt)):
        print(f"Does Not Match [{col}]: {lt} -> {X_df.loc[X_df.index == col, 'Logical Type'].values[0]}")

Gives the output:

Does Not Match [COUNT(orders)]: IntegerNullable -> Integer
Does Not Match [COUNT(order_products)]: IntegerNullable -> Integer
Does Not Match [NUM_UNIQUE(order_products.department)]: IntegerNullable -> Integer
Does Not Match [NUM_UNIQUE(orders.MODE(order_products.department))]: IntegerNullable -> Integer
Does Not Match [SKEW(orders.MAX(order_products.reordered))]: Double -> IntegerNullable
Does Not Match [SKEW(orders.MIN(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [SKEW(orders.NUM_UNIQUE(order_products.department))]: Double -> IntegerNullable
Does Not Match [STD(orders.MAX(order_products.reordered))]: Double -> IntegerNullable
Does Not Match [STD(orders.MIN(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [STD(orders.SKEW(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [COUNT(order_products WHERE department = produce)]: IntegerNullable -> Integer
Does Not Match [COUNT(order_products WHERE product_name = Banana)]: IntegerNullable -> Integer

Hope am not missing something obvious here.

Also, fm.ww.set_types(logical_types=X_train.ww.logical_types) gave this error:

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/table_accessor.py:567, in WoodworkTableAccessor.set_types(self, logical_types, semantic_tags, retain_index_tags, null_invalid_values)
    565 for col_name, logical_type in logical_types.items():
    566     series = self._dataframe[col_name]
--> 567     updated_series = logical_type.transform(
    568         series,
    569         null_invalid_values=null_invalid_values,
    570     )
    571     if updated_series is not series:
    572         self._dataframe[col_name] = updated_series

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/logical_types.py:475, in IntegerNullable.transform(self, series, null_invalid_values)
    473 if null_invalid_values:
    474     series = _coerce_integer(series)
--> 475 return super().transform(series)

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/logical_types.py:76, in LogicalType.transform(self, series, null_invalid_values)
     74         series = series.astype(new_dtype)
     75     except (TypeError, ValueError):
---> 76         raise TypeConversionError(series, new_dtype, type(self))
     77 return series

TypeConversionError: Error converting datatype for SKEW(orders.NUM_UNIQUE(order_products.department)) from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants