Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching boto client to improve artifact download speed #4695

Merged
merged 12 commits into from Dec 2, 2021

Conversation

Samreay
Copy link
Contributor

@Samreay Samreay commented Aug 12, 2021

What changes are proposed in this pull request?

As per #4668, the S3ArtifactDownloader is excessively slow as each file requires a new boto3 client to be instantiated and verified. For larger models with many files, this represents a significant slowdown, extending the download time by around 200%.

This PR separates out the boto3 client creation into a cached function.

As per the discussion in the linked issue, there appears to be no time based expiry of boto3 S3 clients in the documentation:

How is this patch tested?

I am unsure about how to test mlflow properly and would appreciate some guidance on this.

If the code itself looks acceptable, I can carry out my own testing on this locally over a longer time period (several days to validate the client expiry concerns).

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@github-actions
Copy link

@Samreay Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/3310771499. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

@github-actions github-actions bot added area/artifacts Artifact stores and artifact logging rn/bug-fix Mention under Bug Fixes in Changelogs. labels Aug 12, 2021
Signed-off-by: Samuel Hinton <sh@arenko.group>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samreay this looks great! Can you try this out manually and add a description of the manual test results to the PR description? I'll do the same!

@Samreay
Copy link
Contributor Author

Samreay commented Aug 12, 2021

Yup, can do, I'll let it run for a few days in our dev server and report back :)

@dbczumar
Copy link
Collaborator

Yup, can do, I'll let it run for a few days in our dev server and report back :)

Thank you!

@Samreay
Copy link
Contributor Author

Samreay commented Aug 17, 2021

Hi @dbczumar, Ive been unable to test this in our dev system because we can only get the UI to display the warning message below:

Unable to display MLflow UI - landing page (index.html) not found.

You are very likely running the MLflow server using a source installation of the Python MLflow
package.

If you are a developer making MLflow source code changes and intentionally running a source
installation of MLflow, you can view the UI by running the Javascript dev server:
https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#running-the-javascript-dev-server

Otherwise, uninstall MLflow via 'pip uninstall mlflow', reinstall an official MLflow release
from PyPI via 'pip install mlflow', and rerun the MLflow server.

The instructions in the contributing guide only detail how to start a local server with npm in the background and mlflow ui. However, we need to point this to an s3 bucket and remote server for a proper test (which we normally do using mlflow server). Do you know how to mesh the two?

I've also added 5-minute caching which should still resolve bulk artifact download issues, but caters to some of the boto3 endpoint urls which might have 12 hour expiry of pre-signed urls.

EDIT: I can still test the impact of the modifications on the mlflow library when not running it as a server (ie for the packages that talk to the server), I just wanted to test both for completeness. If you believe testing just the latter is sufficient (if the server doesnt use the Artifact downloading at any point) then even easier.

…nt urls

Signed-off-by: Samuel Hinton <sh@arenko.group>
@harupy
Copy link
Member

harupy commented Aug 18, 2021

Hi @Samreay, I think mlflow server ... and cd mlflow/server/js && npm start should work together. Please let us know if they don't.

@Samreay
Copy link
Contributor Author

Samreay commented Aug 18, 2021

Hey @harupy - thats similar to what I tried initially. I had an npm start run in the background and then kicked off mlflow server, but got the same UI in dev message.

@harupy
Copy link
Member

harupy commented Aug 18, 2021

Could you try npm run build (which may take a while to complete), then run mlflow server?

@Samreay
Copy link
Contributor Author

Samreay commented Aug 18, 2021

Ill give it a shot and report back :)

@Samreay
Copy link
Contributor Author

Samreay commented Aug 31, 2021

Hey @dbczumar , @harupy,

Been a couple of weeks, and thought I'd report back that we've been running both dev and since last week, prod, using this PR branch and have had no issues (and a nice performance improvement).

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @Samreay . Can you follow the instructions in https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work to sign your commits?

@Samreay
Copy link
Contributor Author

Samreay commented Sep 9, 2021

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samreay looks like there are a couple of s3 artifact repo test failures. Can you take a look at those?

Samuel Hinton added 2 commits September 9, 2021 12:32
@Samreay
Copy link
Contributor Author

Samreay commented Sep 9, 2021

Hey @dbczumar - interesting.

I see there are pylint issues, but I dont see anything when I run pylint myself nor are there any details I can see in the error log. Is there a different way to run? Running the actual ./lint.sh says everything needs to be reformatted, so I dont attach its results here.

(base) samreay@Samuels-MacBook-Pro artifact % pylint --rcfile ../../../pylintrc s3_artifact_repo.py

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

(base) samreay@Samuels-MacBook-Pro artifact % black s3_artifact_repo.py 
All done! ✨ 🍰 ✨
1 file left unchanged.

Ive updated the tests as well.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
@dbczumar
Copy link
Collaborator

dbczumar commented Sep 9, 2021

Hi @Samreay , thanks for making these changes! I've pushed formatting tweaks generated by running black --line-length=100 --exclude=mlflow/protos . from the root of the MLflow repo with black version 19.10b0.

@dbczumar
Copy link
Collaborator

@Samreay Looks like there are still a few failures in test_s3_artifact_repo.py

@Samreay
Copy link
Contributor Author

Samreay commented Sep 10, 2021

Hey @dbczumar , apologies, the issues should be fixed now.

@Samreay
Copy link
Contributor Author

Samreay commented Sep 10, 2021

@dbczumar And somehow there are new issues, of course. Im going away on leave, but Ill try and figure out whats happening here with the tests when Im back. If you have time to have a look first, just let me know :)

@harupy
Copy link
Member

harupy commented Nov 30, 2021

@Samreay Are you still working on this PR? If not, can I take it up?

@Samreay
Copy link
Contributor Author

Samreay commented Dec 1, 2021

Hey @harupy , I haven't had a chance to look at it. I put a ticket on my backlog and then it was deprioritised and I have no clue when I'll get time allocation to fix up the tests. If this is something you're able to look at, I would greatly appreciate it!

@harupy
Copy link
Member

harupy commented Dec 1, 2021

Thanks for the reply, I think I can handle this!

Signed-off-by: harupy <hkawamura0130@gmail.com>
Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy
Copy link
Member

harupy commented Dec 1, 2021

@Samreay I've pushed several changes. Do they look good?

@Samreay
Copy link
Contributor Author

Samreay commented Dec 1, 2021

They look great, I'll have to remember autouse=True for future fixtures!

@harupy
Copy link
Member

harupy commented Dec 2, 2021

autoformat

Signed-off-by: harupy <hkawamura0130@gmail.com>
@harupy
Copy link
Member

harupy commented Dec 2, 2021

autoformat

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@harupy harupy merged commit 19a82fe into mlflow:master Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants