Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pyfunc cannot predict dataframe with None value properly #4827

Closed
3 of 23 tasks
yxiong opened this issue Sep 17, 2021 · 9 comments
Closed
3 of 23 tasks

[BUG] pyfunc cannot predict dataframe with None value properly #4827

yxiong opened this issue Sep 17, 2021 · 9 comments
Assignees
Labels
area/models MLmodel format, model serialization/deserialization, flavors bug Something isn't working integrations/databricks Databricks integrations

Comments

@yxiong
Copy link
Contributor

yxiong commented Sep 17, 2021

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Databricks ML runtime 9.0
  • MLflow installed from (source or binary):
  • MLflow version (run mlflow --version):
  • Python version:
  • npm version, if running the dev UI:
  • Exact command to reproduce:

Describe the problem

My training data contains None values, and I built a sklearn pipeline with imputer to handle it. Then I train the pipeline model with MLflow tracking enabled:

transformers = [
    ("numerical", SimpleImputer(strategy="mean"), ["foo", "bar"]),
    ......
]
preprocessor = ColumnTransformer([
    transformers, remainder="passthrough", sparse_threshold=0)
model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(...)),
])

with mlflow.start_run(run_name="my_run") as mlflow_run:
    model.fit(X_train, y_train)
    mlflow.sklearn.eval_and_log_metrics(model, X_val, y_val)

The trained model itself is able to do predict with no problem

model.predict(X_train)  # ==> OK

But the the pyfunc object I got from MLflow doesn't work:

model_pyfunc = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=mlflow_run.info.run_id
  )
)

model_pyfunc.predict(X_train)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-4119235480823764> in <module>
      5 )
      6 
----> 7 model_pyfunc.predict(X_train)

/databricks/python/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py in predict(self, data)
    594         if input_schema is not None:
    595             data = _enforce_schema(data, input_schema)
--> 596         return self._model_impl.predict(data)
    597 
    598     @property

/databricks/python/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

This happened only to data with None values. If I run inference to a subset of X_train without None, the model_pyfunc.predict function also works.

Code to reproduce issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@yxiong yxiong added the bug Something isn't working label Sep 17, 2021
@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors integrations/databricks Databricks integrations labels Sep 17, 2021
@dbczumar
Copy link
Collaborator

@yxiong Thanks for raising this issue! Can you provide a more complete stacktrace for the failure?

@yxiong
Copy link
Contributor Author

yxiong commented Sep 18, 2021

Hi @dbczumar ,

What I posted is actually the entire stacktrace (only two function calls and then exception). I also created a small piece of code that can reproduce the issue. Hope that helps.

import pandas as pd
import mlflow

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

one_hot_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])
transformers = [("onehot", one_hot_pipeline, ["feature"])]
preprocessor = ColumnTransformer(transformers, remainder="passthrough", sparse_threshold=0)
model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier()),
])

mlflow.autolog()
with mlflow.start_run(run_name='decision_tree') as run:
  model.fit(X_train, y_train)

model.predict(X_train)  # ==> ok

model_pyfunc = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(run_id=run.info.run_id))
model_pyfunc.predict(X_train)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-1260285005520992> in <module>
      2   'runs:/{run_id}/model'.format(run_id=run.info.run_id))
      3 
----> 4 model_pyfunc.predict(X_train)

/databricks/python/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py in predict(self, data)
    594         if input_schema is not None:
    595             data = _enforce_schema(data, input_schema)
--> 596         return self._model_impl.predict(data)
    597 
    598     @property

/databricks/python/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

@dbczumar
Copy link
Collaborator

@yxiong Got it. This seems to be an issue with the dataset being erroneously transformed during model schema enforcement that occurs within the pyfunc inference procedure. I'll loop in an area expert who can take a look.

@yxiong
Copy link
Contributor Author

yxiong commented Sep 19, 2021

Thanks for triaging, @dbczumar !

I looked into this a little more, and confirmed that this is an issue with ModelSignature. Here is a more concise code snippet that reproduces the issue:

import mlflow
import pandas as pd
import random
import string

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

model = Pipeline([
    ("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ("classifier", DecisionTreeClassifier()),
])

model.fit(X_train, y_train)
model.predict(X_train)  # ==> ok

# Save model and the load it to pyfunc works.
path = "/tmp/sklearn-model/" + ''.join(
    random.choice(string.ascii_lowercase) for _ in range(6))
print("Model path =", path)
signature = mlflow.models.infer_signature(model_input=X_train)
mlflow.sklearn.save_model(model, path, signature=signature)
pyfunc_model = mlflow.pyfunc.load_model(path)
pyfunc_model.predict(X_train)   # AttributeError: 'bool' object has no attribute 'any'

If I remove signature argument from the save_model function, the code executes without any error.

@yxiong
Copy link
Contributor Author

yxiong commented Sep 21, 2021

I believe I have identified the root cause:

  1. In the PyFuncModel.predict function, it does _enforce_schema which casts the numpy array from np.object to pandas string type code. In this process, the native None objects will be cast to pandas._libs.missing.NAType.

  2. However, the SimpleImputer(missing_values=None, strategy="constant", fill_value="") I used cannot handle the NAType:

    ~/opt/anaconda3/envs/mlflow-dev-env/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
        108     # for object dtype data, we only check for NaNs (GH-13254)
        109     elif X.dtype == np.dtype('object') and not allow_nan:
    --> 110         if _object_dtype_isnan(X).any():
        111             raise ValueError("Input contains NaN")
        112 

@tomasatdatabricks Do you have some suggestions on how this should be fixed?

@yxiong
Copy link
Contributor Author

yxiong commented Nov 8, 2021

[Update] After scikit-learn/scikit-learn#21114, the validation can pass if we set missing_values to pd.NA instead of None. See code snippet below:

import mlflow
import pandas as pd
import random
import string

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})

model = Pipeline([
    ("imputer", SimpleImputer(missing_values=pd.NA, strategy="constant", fill_value="")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ("classifier", DecisionTreeClassifier()),
])

model.fit(X_train, y_train)
model.predict(X_train)  # ==> ok

# Save model and the load it to pyfunc works.
path = ("/tmp/sklearn-model/" +
        ''.join(random.choice(string.ascii_lowercase) for _ in range(6)))
print("Model path =", path)
signature = mlflow.models.infer_signature(model_input=X_train)
mlflow.sklearn.save_model(model, path, signature=signature)
pyfunc_model = mlflow.pyfunc.load_model(path)
pyfunc_model.predict(X_train)   # ==> AttributeError: 'bool' object has no attribute 'any'

@aarondav
Copy link
Contributor

aarondav commented Dec 8, 2021

Thanks for making the change to scikit-learn! It does seem like a good route to me. MLflow's schema enforcement seems to convert None values to pandas.NA, which doesn't seem totally unreasonable -- it just doesn't work with scikit-learn's SimpleImputer.

I'm not sure if there's another "more correct" behavior for MLflow. We could keep "None", but it's not obvious that that's better in all cases, so fixing this on the scikit side seems like a reasonable long-term solution.

cc @tomasatdatabricks

@yxiong
Copy link
Contributor Author

yxiong commented Dec 8, 2021

Thanks for the feedback, @aarondav . I was able to connect with @tomasatdatabricks , and we think it's probably the best not to cast pandas objects to string for now as it will break a few downstream places. #5134 fixed this issue.

@yxiong yxiong closed this as completed Dec 8, 2021
@aarondav
Copy link
Contributor

aarondav commented Dec 8, 2021

Oh, nice, that's great. Thanks for the follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors bug Something isn't working integrations/databricks Databricks integrations
Projects
None yet
Development

No branches or pull requests

4 participants