Predict does not work because of data type mismatch for same dataframe. #4124

Mhsh · 2023-04-05T18:28:49Z

I have trained a dataset using eval ml and below is the best fit pipeline.

pipeline = RegressionPipeline(component_graph={'Replace Nullable Types Transformer': ['Replace Nullable Types Transformer', 'X', 'y'], 'Imputer': ['Imputer', 'Replace Nullable Types Transformer.x', 'Replace Nullable Types Transformer.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Replace Nullable Types Transformer.y'], 'Random Forest Regressor': ['Random Forest Regressor', 'One Hot Encoder.x', 'Replace Nullable Types Transformer.y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0)

I have stored the model and loading it with below code and everything is working fine.
Model Store
import pickle
best_pipeline.save(MODEL_NAME+'.pkl')

Model Load
with open('BidPrediction.pkl', "rb") as f:
model = pickle.load(f)
try:
df = X_train
print(model.predict(df))
except Exception as e:
print(e)

Ouput:
501 117.129174
90 343.367964
527 153.735972
576 225.164953
200 140.293371
...
277 222.844593
9 1127.711225
359 1385.688900
192 146.658027
559 833.751599
Name: Ideal Bid Price/Cts ($), Length: 485, dtype: float64

ISSUE
Now I am trying to use this model to predict a single data from dataframe and it is giving below error.

try:
df = X_train.head(1)
print(model.predict(df))
except Exception as e:
print(e)

ERROR:
Input X data types are different from the input types the pipeline was fitted on.

When I tried to inspect the error it seems the problem is that there is different data type are assigned to pipeline as compared with input feature. My guess is that it works with whole dataset because there are null entries in 'Target Days' and 'Actual Days' whereas it is not null when single instance is passed.

ERROR details:

{'input_features_types': Logical Type Semantic Tag(s)
Column
Rough Type Categorical ['category']
Source Categorical ['category']
Avg Size Double ['numeric']
Manufaturing Rate Per Cts ($) Double ['numeric']
Expected Color Variation % Double ['numeric']
Expected Polish Variation % Double ['numeric']
Profit Margin % Double ['numeric']
Sales Rate Per Cts Double ['numeric']
Target Days Integer ['numeric']
Actual Days Integer ['numeric']
Interest Paid/Cts Double ['numeric'], 'pipeline_features_types': Logical Type Semantic Tag(s)
Column
Rough Type Categorical ['category']
Source Categorical ['category']
Avg Size Double ['numeric']
Manufaturing Rate Per Cts ($) Double ['numeric']
Expected Color Variation % Double ['numeric']
Expected Polish Variation % Double ['numeric']
Profit Margin % Double ['numeric']
Sales Rate Per Cts Double ['numeric']
Target Days IntegerNullable ['numeric']
Actual Days IntegerNullable ['numeric']
Interest Paid/Cts Double ['numeric']}

Not sure how to convert the Integer to IntegerNullable as I am getting error when I try to use single row of dataframe for prediction.

Note- I will deploy this model so mostly single record with come for prediction.

The text was updated successfully, but these errors were encountered:

eccabay · 2023-04-06T13:37:55Z

Thanks for filing this @Mhsh! We'll take a look at this issue and get back to you soon.

Mhsh · 2023-04-07T04:51:42Z

I had the same problem for category data. When I was passing the single record for prediction the same above error occurred and then I have to manually convert the data type to 'category' as below.

X_train['Rough Type'].head(1).astype('category')

jeremyliweishih · 2023-04-10T14:10:09Z

@Mhsh do you mind showing the stack trace for the categorical data case like for the integer nullable case? Thanks!

tamargrey · 2023-04-10T15:35:54Z

We should be able to handle the Integer/IntegerNullable case with #4077.

Seeing @Mhsh's stack trace for the categorical data will be helpful to decide if/how to handle this since the Category logical type can already handle nullable types. I wonder if the presence of nullable types is changing the woodwork type that's inferred? If that's the case, I don't think any fix for #4077 would apply here, as we couldn't use the two types interchangeably. We could, however, discuss other changes to _schema_is_equal that we could allow that would help in this situation.

Mhsh · 2023-04-10T16:19:25Z

I got below error when I passed the single dataframe for prediction. This was not related to nullable object or something but the data (categorical) which I was passing in dataframe was not recognised as categorical logical type.

When I tried to see the details below was the response.

The above code works when I pass X_test dataframe which contains record which imitates the X_train data.

tamargrey · 2023-04-10T19:12:43Z

@Mhsh the Categorical data is being inferred as Unknown when there is only one row. A workaround that should work for any of these type differences while we implement a fix via #4133 would be to initialize Woodwork on df with the types from X_train.

df.ww.init(schema=X_train.ww.schema)
model.predict(df)

Mhsh · 2023-04-17T05:32:12Z

Thanks @tamargrey. The code is working with the workaround that you provided.

gautamborad · 2023-06-13T14:05:54Z

@tamargrey , am getting the below error when i tried to set the types from X_train:

TypeConversionError: Error converting datatype for SKEW(orders.NUM_UNIQUE(order_products.department)) from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.

Am following the tutorial here: https://compose.alteryx.com/en/stable/examples/predict_next_purchase.html. Following are my changes:

fm = ft.calculate_feature_matrix(
    features=fd,
    entityset=es,
    cutoff_time=ft.pd.Timestamp("2015-07-02"),
    cutoff_time_in_index=True,
    verbose=False,
)

display(fm.head())

fm.ww.init(schema=X_train.ww.schema) <== Giving error on this line

y_pred = best_pipeline.predict(fm)
y_pred = y_pred.values

prediction = fm[[]]
prediction["bought_product (estimate)"] = y_pred
prediction.head()

tamargrey · 2023-06-13T15:03:02Z

@gautamborad I expect this error is happening because fm contains columns that aren't in X_train (for example - SKEW(orders.NUM_UNIQUE(order_products.department)) is presumably an engineered feature defined in fd), so fm.ww.init(schema=X_train.ww.schema) will perform type inference on any columns that aren't in X_train.

If SKEW(orders.NUM_UNIQUE(order_products.department)) would get inferred as IntegerNullable but had the float64 dtype coming out of calculate_feature_matrix, that would cause this error.

if you want to update the logical types of only the columns in fm that are also in X_train, I'd suggest using set_types instead: fm.ww.set_types(logical_types=X_train.ww.logical_types). This of course will not work if there are columns in X_train that aren't in fm, so that may be something you have to account for.

gautamborad · 2023-06-13T17:28:11Z

@tamargrey thanks for the quick reply! I think the columns in both fm and X_train matches, its just that in fm some of the types are not inferred properly. I can set them manually, but would be great it i could just copy the schema.

set(fm.columns) == set(X_train.columns)

True

ww_df = fm.ww.schema.types.reset_index()
X_df = X_train.ww.schema.types
for i, r in ww_df.iterrows():
    col = r["Column"]
    lt = r["Logical Type"]
    if not any((X_df.index == col) & (X_df['Logical Type'] == lt)):
        print(f"Does Not Match [{col}]: {lt} -> {X_df.loc[X_df.index == col, 'Logical Type'].values[0]}")

Gives the output:

Does Not Match [COUNT(orders)]: IntegerNullable -> Integer
Does Not Match [COUNT(order_products)]: IntegerNullable -> Integer
Does Not Match [NUM_UNIQUE(order_products.department)]: IntegerNullable -> Integer
Does Not Match [NUM_UNIQUE(orders.MODE(order_products.department))]: IntegerNullable -> Integer
Does Not Match [SKEW(orders.MAX(order_products.reordered))]: Double -> IntegerNullable
Does Not Match [SKEW(orders.MIN(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [SKEW(orders.NUM_UNIQUE(order_products.department))]: Double -> IntegerNullable
Does Not Match [STD(orders.MAX(order_products.reordered))]: Double -> IntegerNullable
Does Not Match [STD(orders.MIN(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [STD(orders.SKEW(order_products.add_to_cart_order))]: Double -> IntegerNullable
Does Not Match [COUNT(order_products WHERE department = produce)]: IntegerNullable -> Integer
Does Not Match [COUNT(order_products WHERE product_name = Banana)]: IntegerNullable -> Integer

Hope am not missing something obvious here.

Also, fm.ww.set_types(logical_types=X_train.ww.logical_types) gave this error:

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/table_accessor.py:567, in WoodworkTableAccessor.set_types(self, logical_types, semantic_tags, retain_index_tags, null_invalid_values)
    565 for col_name, logical_type in logical_types.items():
    566     series = self._dataframe[col_name]
--> 567     updated_series = logical_type.transform(
    568         series,
    569         null_invalid_values=null_invalid_values,
    570     )
    571     if updated_series is not series:
    572         self._dataframe[col_name] = updated_series

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/logical_types.py:475, in IntegerNullable.transform(self, series, null_invalid_values)
    473 if null_invalid_values:
    474     series = _coerce_integer(series)
--> 475 return super().transform(series)

File ~/opt/anaconda3/envs/alteryx3.9/lib/python3.9/site-packages/woodwork/logical_types.py:76, in LogicalType.transform(self, series, null_invalid_values)
     74         series = series.astype(new_dtype)
     75     except (TypeError, ValueError):
---> 76         raise TypeConversionError(series, new_dtype, type(self))
     77 return series

TypeConversionError: Error converting datatype for SKEW(orders.NUM_UNIQUE(order_products.department)) from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.

Mhsh changed the title ~~Predict does not work because of .~~ Predict does not work because of data type mismatch for same dataframe. Apr 5, 2023

exalate-issue-sync bot assigned eccabay Apr 6, 2023

exalate-issue-sync bot unassigned eccabay Apr 25, 2023

exalate-issue-sync bot assigned jeremyliweishih and unassigned jeremyliweishih May 17, 2023

exalate-issue-sync bot assigned tamargrey Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predict does not work because of data type mismatch for same dataframe. #4124

Predict does not work because of data type mismatch for same dataframe. #4124

Mhsh commented Apr 5, 2023

eccabay commented Apr 6, 2023

Mhsh commented Apr 7, 2023

jeremyliweishih commented Apr 10, 2023

tamargrey commented Apr 10, 2023

Mhsh commented Apr 10, 2023 •

edited

tamargrey commented Apr 10, 2023

Mhsh commented Apr 17, 2023

gautamborad commented Jun 13, 2023

tamargrey commented Jun 13, 2023

gautamborad commented Jun 13, 2023 •

edited

Predict does not work because of data type mismatch for same dataframe. #4124

Predict does not work because of data type mismatch for same dataframe. #4124

Comments

Mhsh commented Apr 5, 2023

eccabay commented Apr 6, 2023

Mhsh commented Apr 7, 2023

jeremyliweishih commented Apr 10, 2023

tamargrey commented Apr 10, 2023

Mhsh commented Apr 10, 2023 • edited

tamargrey commented Apr 10, 2023

Mhsh commented Apr 17, 2023

gautamborad commented Jun 13, 2023

tamargrey commented Jun 13, 2023

gautamborad commented Jun 13, 2023 • edited

Mhsh commented Apr 10, 2023 •

edited

gautamborad commented Jun 13, 2023 •

edited