[BUG] pyfunc cannot predict dataframe with None value properly #4827

yxiong · 2021-09-17T23:41:57Z

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Databricks ML runtime 9.0
MLflow installed from (source or binary):
MLflow version (run mlflow --version):
Python version:
npm version, if running the dev UI:
Exact command to reproduce:

Describe the problem

My training data contains None values, and I built a sklearn pipeline with imputer to handle it. Then I train the pipeline model with MLflow tracking enabled:

transformers = [
    ("numerical", SimpleImputer(strategy="mean"), ["foo", "bar"]),
    ......
]
preprocessor = ColumnTransformer([
    transformers, remainder="passthrough", sparse_threshold=0)
model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(...)),
])

with mlflow.start_run(run_name="my_run") as mlflow_run:
    model.fit(X_train, y_train)
    mlflow.sklearn.eval_and_log_metrics(model, X_val, y_val)

The trained model itself is able to do predict with no problem

model.predict(X_train)  # ==> OK

But the the pyfunc object I got from MLflow doesn't work:

model_pyfunc = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=mlflow_run.info.run_id
  )
)

model_pyfunc.predict(X_train)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-4119235480823764> in <module>
      5 )
      6 
----> 7 model_pyfunc.predict(X_train)

/databricks/python/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py in predict(self, data)
    594         if input_schema is not None:
    595             data = _enforce_schema(data, input_schema)
--> 596         return self._model_impl.predict(data)
    597 
    598     @property

/databricks/python/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

This happened only to data with None values. If I run inference to a subset of X_train without None, the model_pyfunc.predict function also works.

Code to reproduce issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

dbczumar · 2021-09-17T23:51:06Z

@yxiong Thanks for raising this issue! Can you provide a more complete stacktrace for the failure?

yxiong · 2021-09-18T00:41:53Z

Hi @dbczumar ,

What I posted is actually the entire stacktrace (only two function calls and then exception). I also created a small piece of code that can reproduce the issue. Hope that helps.

import pandas as pd
import mlflow

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

one_hot_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])
transformers = [("onehot", one_hot_pipeline, ["feature"])]
preprocessor = ColumnTransformer(transformers, remainder="passthrough", sparse_threshold=0)
model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier()),
])

mlflow.autolog()
with mlflow.start_run(run_name='decision_tree') as run:
  model.fit(X_train, y_train)

model.predict(X_train)  # ==> ok

model_pyfunc = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(run_id=run.info.run_id))
model_pyfunc.predict(X_train)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-1260285005520992> in <module>
      2   'runs:/{run_id}/model'.format(run_id=run.info.run_id))
      3 
----> 4 model_pyfunc.predict(X_train)

/databricks/python/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py in predict(self, data)
    594         if input_schema is not None:
    595             data = _enforce_schema(data, input_schema)
--> 596         return self._model_impl.predict(data)
    597 
    598     @property

/databricks/python/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

dbczumar · 2021-09-18T01:07:40Z

@yxiong Got it. This seems to be an issue with the dataset being erroneously transformed during model schema enforcement that occurs within the pyfunc inference procedure. I'll loop in an area expert who can take a look.

yxiong · 2021-09-19T17:43:25Z

Thanks for triaging, @dbczumar !

I looked into this a little more, and confirmed that this is an issue with ModelSignature. Here is a more concise code snippet that reproduces the issue:

import mlflow
import pandas as pd
import random
import string

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

model = Pipeline([
    ("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ("classifier", DecisionTreeClassifier()),
])

model.fit(X_train, y_train)
model.predict(X_train)  # ==> ok

# Save model and the load it to pyfunc works.
path = "/tmp/sklearn-model/" + ''.join(
    random.choice(string.ascii_lowercase) for _ in range(6))
print("Model path =", path)
signature = mlflow.models.infer_signature(model_input=X_train)
mlflow.sklearn.save_model(model, path, signature=signature)
pyfunc_model = mlflow.pyfunc.load_model(path)
pyfunc_model.predict(X_train)   # AttributeError: 'bool' object has no attribute 'any'

If I remove signature argument from the save_model function, the code executes without any error.

yxiong · 2021-09-21T22:41:39Z

I believe I have identified the root cause:

In the PyFuncModel.predict function, it does _enforce_schema which casts the numpy array from np.object to pandas string type code. In this process, the native None objects will be cast to pandas._libs.missing.NAType.

However, the SimpleImputer(missing_values=None, strategy="constant", fill_value="") I used cannot handle the NAType:

~/opt/anaconda3/envs/mlflow-dev-env/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    108     # for object dtype data, we only check for NaNs (GH-13254)
    109     elif X.dtype == np.dtype('object') and not allow_nan:
--> 110         if _object_dtype_isnan(X).any():
    111             raise ValueError("Input contains NaN")
    112

@tomasatdatabricks Do you have some suggestions on how this should be fixed?

yxiong · 2021-11-08T01:47:44Z

[Update] After scikit-learn/scikit-learn#21114, the validation can pass if we set missing_values to pd.NA instead of None. See code snippet below:

import mlflow
import pandas as pd
import random
import string

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

model = Pipeline([
    ("imputer", SimpleImputer(missing_values=pd.NA, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ("classifier", DecisionTreeClassifier()),
])

model.fit(X_train, y_train)
model.predict(X_train)  # ==> ok

# Save model and the load it to pyfunc works.
path = ("/tmp/sklearn-model/" +
        ''.join(random.choice(string.ascii_lowercase) for _ in range(6)))
print("Model path =", path)
signature = mlflow.models.infer_signature(model_input=X_train)
mlflow.sklearn.save_model(model, path, signature=signature)
pyfunc_model = mlflow.pyfunc.load_model(path)
pyfunc_model.predict(X_train)   # ==> AttributeError: 'bool' object has no attribute 'any'

aarondav · 2021-12-08T01:35:46Z

Thanks for making the change to scikit-learn! It does seem like a good route to me. MLflow's schema enforcement seems to convert None values to pandas.NA, which doesn't seem totally unreasonable -- it just doesn't work with scikit-learn's SimpleImputer.

I'm not sure if there's another "more correct" behavior for MLflow. We could keep "None", but it's not obvious that that's better in all cases, so fixing this on the scikit side seems like a reasonable long-term solution.

cc @tomasatdatabricks

yxiong · 2021-12-08T04:59:37Z

Thanks for the feedback, @aarondav . I was able to connect with @tomasatdatabricks , and we think it's probably the best not to cast pandas objects to string for now as it will break a few downstream places. #5134 fixed this issue.

aarondav · 2021-12-08T18:05:31Z

Oh, nice, that's great. Thanks for the follow-up.

yxiong added the bug Something isn't working label Sep 17, 2021

github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors integrations/databricks Databricks integrations labels Sep 17, 2021

dbczumar assigned tomasatdatabricks Sep 18, 2021

yxiong closed this as completed Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pyfunc cannot predict dataframe with None value properly #4827

[BUG] pyfunc cannot predict dataframe with None value properly #4827

yxiong commented Sep 17, 2021

dbczumar commented Sep 17, 2021

yxiong commented Sep 18, 2021

dbczumar commented Sep 18, 2021

yxiong commented Sep 19, 2021

yxiong commented Sep 21, 2021

yxiong commented Nov 8, 2021

aarondav commented Dec 8, 2021

yxiong commented Dec 8, 2021

aarondav commented Dec 8, 2021

[BUG] pyfunc cannot predict dataframe with None value properly #4827

[BUG] pyfunc cannot predict dataframe with None value properly #4827

Comments

yxiong commented Sep 17, 2021

Willingness to contribute

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s), interfaces, languages, and integrations does this bug affect?

dbczumar commented Sep 17, 2021

yxiong commented Sep 18, 2021

dbczumar commented Sep 18, 2021

yxiong commented Sep 19, 2021

yxiong commented Sep 21, 2021

yxiong commented Nov 8, 2021

aarondav commented Dec 8, 2021

yxiong commented Dec 8, 2021

aarondav commented Dec 8, 2021