Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fixed model serve fail with HTTP 400 on Bad Request. #5003

Merged
merged 12 commits into from Dec 3, 2021

Conversation

abatomunkuev
Copy link
Contributor

@abatomunkuev abatomunkuev commented Nov 4, 2021

Signed-off-by: Andrei Batomunkuev andreibatomunkuev@gmail.com

What changes are proposed in this pull request?

Solves #4897.

Please refer to my root cause analysis of the issue. I have explained that the root cause of the issue might be an incorrect passing value MALFORMED REQUEST to the argument error_code in function _handle_serving_error.

To fix this, I have changed the passing value argument of error_code from MALFORMED REQUEST to BAD REQUEST.

The HyperText Transfer Protocol (HTTP) 500 Internal Server Error server error response code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. Source

The HyperText Transfer Protocol (HTTP) 400 Bad Request response status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing). Source

To reproduce the error lets run the following command from the issue

curl -i -X POST -d "{\"data\":0.0199132142]}" -H "Content-Type: application/json" http://localhost:5000/invocations

In our case, the error should be 400 Bad Request since we are passing an incorrect syntax/value of json object. So, the error message should reflect the corresponding issue.

How is this patch tested?

Currently, the unit tests test_scoring_server.py fail since we have updated the value to BAD REQUEST. So, we need to either fix those existing tests or write a new unit test. I will need some help in fixing unit tests, I am a little bit confused which unit test belong to this issue.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

When the users passes an incorrect syntax of json object to the server, the server should display the following message:

HTTP/1.1 400 BAD REQUEST
Server: gunicorn
Date: Thu, 04 Nov 2021 06:29:13 GMT
Connection: close
Content-Type: application/json
Content-Length: 1030

{"error_code": "BAD_REQUEST", "message": "Failed to parse input from JSON. Ensure that input is a valid JSON formatted string.",....

Example of the incorrect json syntax value:

curl -i -X POST -d "{\"data\":0.0199132142]}" -H "Content-Type: application/json" http://localhost:5000/invocations

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@github-actions
Copy link

github-actions bot commented Nov 4, 2021

@abatomunkuev Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/4102308049. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

@github-actions github-actions bot added area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/bug-fix Mention under Bug Fixes in Changelogs. labels Nov 4, 2021
@abatomunkuev abatomunkuev force-pushed the issue#4897 branch 2 times, most recently from fdb9b38 to a6340c8 Compare November 4, 2021 11:41
Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@dbczumar
Copy link
Collaborator

dbczumar commented Nov 4, 2021

Hi @abatomunkuev, the failures in https://github.com/mlflow/mlflow/runs/4104942369?check_suite_focus=true appear to be caused by test cases asserting that the response of certain scoring operations with bad input is 500. Now that we've changed these to 400, can we update the test cases as well?

The failures in https://github.com/mlflow/mlflow/runs/4104942145?check_suite_focus=true appear to be unrelated. I'm investigating the cause.

@dbczumar
Copy link
Collaborator

dbczumar commented Nov 4, 2021

The flavors tests have been fixed via https://github.com/tensorflow/tensorflow/pull/52927/files, which was just released with TensorFlow 2.6. I've restarted those tests. We'll still need to address the other failures noted above.

@abatomunkuev
Copy link
Contributor Author

@dbczumar Hello! I am going to work on updating the tests.

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@abatomunkuev
Copy link
Contributor Author

@dbczumar Hello! I have made some changes to the following test case in tests/models/test_cli.py:

def _validate_with_rest_endpoint(scoring_proc, host_port, df, x, sk_model, enable_mlserver=False):

I wondering should we update the following line?

assert scoring_response_dict["error_code"] == ErrorCode.Name(MALFORMED_REQUEST)

Should we change ErrorCode.Name(MALFORMED_REQUEST) to ErrorCode.Name(BAD_REQUEST)?

@dbczumar
Copy link
Collaborator

@dbczumar Hello! I have made some changes to the following test case in tests/models/test_cli.py:

def _validate_with_rest_endpoint(scoring_proc, host_port, df, x, sk_model, enable_mlserver=False):

I wondering should we update the following line?

assert scoring_response_dict["error_code"] == ErrorCode.Name(MALFORMED_REQUEST)

Should we change ErrorCode.Name(MALFORMED_REQUEST) to ErrorCode.Name(BAD_REQUEST)?

Hi @abatomunkuev . Yes, let's change MALFORMED_REQUEST to BAD_REQUEST.

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@abatomunkuev
Copy link
Contributor Author

@dbczumar done!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @abatomunkuev !

@dbczumar
Copy link
Collaborator

dbczumar commented Nov 15, 2021

@abatomunkuev Almost there! Looks like test_build_docker and test_build_docker_with_env_override are still expecting 500 responses, so the tests are failing. Can you update them?

@dbczumar
Copy link
Collaborator

@abatomunkuev Apologies for misinterpreting the test output. It appears that the server is still responding with 500s where 400s are now expected.

@abatomunkuev
Copy link
Contributor Author

@dbczumar Should we update the error codes in pyfunc/scoring_server/__init__.py?

In some functions we are still returning MALFORMED_REQUEST.

https://github.com/abatomunkuev/mlflow/blob/18e0d2caa8ba7849f4f53ec6c08ce4eb3fe1a1f7/mlflow/pyfunc/scoring_server/__init__.py#L74-L177

I am not sure if its a good way to solve the issue.

We can either update _validate_with_rest_endpoint() test case. What do you think?

@dbczumar
Copy link
Collaborator

@dbczumar Should we update the error codes in pyfunc/scoring_server/__init__.py?

In some functions we are still returning MALFORMED_REQUEST.

https://github.com/abatomunkuev/mlflow/blob/18e0d2caa8ba7849f4f53ec6c08ce4eb3fe1a1f7/mlflow/pyfunc/scoring_server/__init__.py#L74-L177

I am not sure if its a good way to solve the issue.

We can either update _validate_with_rest_endpoint() test case. What do you think?

@abatomunkuev I think it's best to change other instances to BAD_REQUEST. MALFORMED_REQUEST currently produces incorrect behavior, as the correct response type to invalid inputs is a 400 error.

@abatomunkuev
Copy link
Contributor Author

Okay, I am going to update the code in pyfunc/scoring_server/__init__.py

@dbczumar
Copy link
Collaborator

Okay, I am going to update the code in pyfunc/scoring_server/__init__.py

Thank you!

@dbczumar
Copy link
Collaborator

@abatomunkuev Looks like there are a few more small failures in test cases that expect MALFORMED_REQUEST instead of BAD_REQUEST. Can we update the test cases?

@abatomunkuev
Copy link
Contributor Author

@dbczumar Sure, I just need to find which test cases fail.

@abatomunkuev
Copy link
Contributor Author

@dbczumar Thanks!

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@abatomunkuev
Copy link
Contributor Author

abatomunkuev commented Nov 23, 2021

@dbczumar Tests were updated, it should resolve the issue in flavors.

However, in model. We got 2 failures. It seems to me that we are still getting 500 INTERNAL ERROR.

image

@dbczumar
Copy link
Collaborator

@abatomunkuev Perhaps there's something special about the empty string case being tested here?

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@abatomunkuev
Copy link
Contributor Author

@dbczumar Could you review the changes? I hope this time it will work.

I am wondering if we should leave MALFORMED_REQUEST in this case?
6677ca4#diff-e607d3e6aefa036b1bfe9416fd1aa8b0934120f458cf263ddebab224f832e4a5R99

@dbczumar
Copy link
Collaborator

Thanks @abatomunkuev. Your change looks good - let's get rid of MALFORMED_REQUEST in this file. Can we also use BAD_REQUEST here? https://github.com/mlflow/mlflow/pull/5003/files#diff-e607d3e6aefa036b1bfe9416fd1aa8b0934120f458cf263ddebab224f832e4a5R99

@abatomunkuev
Copy link
Contributor Author

@dbczumar Sure!

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@dbczumar
Copy link
Collaborator

@abatomunkuev Can we resolve conflicts with the master branch for mlflow/pyfunc/scoring_server/__init__.py?

@abatomunkuev
Copy link
Contributor Author

abatomunkuev commented Nov 23, 2021

@dbczumar Yes.

image

What do I need to do to resolve this conflict?

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
@abatomunkuev
Copy link
Contributor Author

abatomunkuev commented Nov 23, 2021

@dbczumar
https://github.com/abatomunkuev/mlflow/blob/0439bfb25e94d44814c6cd3af3c4a84af368ee3f/mlflow/pyfunc/scoring_server/__init__.py#L116-L136

I think this is portion of code causes the error. I don't know why it causes the error, since we changed all MALFORMED_REQUEST instances to BAD_REQUEST.

Or maybe we should update the value of the data argument.
https://github.com/abatomunkuev/mlflow/blob/0439bfb25e94d44814c6cd3af3c4a84af368ee3f/tests/models/test_cli.py#L480

@dbczumar
Copy link
Collaborator

@abatomunkuev Let me know if you need help investigating this.

@abatomunkuev
Copy link
Contributor Author

@dbczumar I would like to run tests locally. However, I got the error building a docker image to serve the model.

I'm running a test: pytest tests/models/test_cli.py --large

image

@abatomunkuev
Copy link
Contributor Author

I am going to update the value of the data argument.
https://github.com/abatomunkuev/mlflow/blob/0439bfb25e94d44814c6cd3af3c4a84af368ee3f/tests/models/test_cli.py#L480
Maybe something like this:

scoring_response = endpoint.invoke(data="this is,,,, not valid json", content_type=content_type)

Signed-off-by: Andrei Batomunkuev <abatomunkuev@myseneca.ca>
@dbczumar
Copy link
Collaborator

@abatomunkuev Did we identify an issue with the empty string?

@abatomunkuev
Copy link
Contributor Author

@dbczumar Not complete sure, but it might be the cause.

Also, I have a question. How can I test the cases that involves Docker containers & serving the model on my local machine. I am not able to do that, it gives me the error.
image

@dbczumar
Copy link
Collaborator

@abatomunkuev Do you have Docker installed on your machine?

@abatomunkuev
Copy link
Contributor Author

@dbczumar Yes, I do have.

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy
Copy link
Member

harupy commented Dec 2, 2021

@dbczumar @abatomunkuev I pushed a commit to fix the test failures: 4308be6

It looks like mlserver doesn't respect the status code of MlflowException.

@dbczumar
Copy link
Collaborator

dbczumar commented Dec 3, 2021

Fantastic! Thank you so much for your contribution, @abatomunkuev , and thank you for identifying the cause of the failure, @harupy !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants