Caching boto client to improve artifact download speed #4695

Samreay · 2021-08-12T10:55:15Z

What changes are proposed in this pull request?

As per #4668, the S3ArtifactDownloader is excessively slow as each file requires a new boto3 client to be instantiated and verified. For larger models with many files, this represents a significant slowdown, extending the download time by around 200%.

This PR separates out the boto3 client creation into a cached function.

As per the discussion in the linked issue, there appears to be no time based expiry of boto3 S3 clients in the documentation:

How is this patch tested?

I am unsure about how to test mlflow properly and would appreciate some guidance on this.

If the code itself looks acceptable, I can carry out my own testing on this locally over a longer time period (several days to validate the client expiry concerns).

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2021-08-12T10:56:29Z

@Samreay Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/3310771499. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

Signed-off-by: Samuel Hinton <sh@arenko.group>

dbczumar

@Samreay this looks great! Can you try this out manually and add a description of the manual test results to the PR description? I'll do the same!

Samreay · 2021-08-12T21:47:28Z

Yup, can do, I'll let it run for a few days in our dev server and report back :)

dbczumar · 2021-08-12T23:18:26Z

Yup, can do, I'll let it run for a few days in our dev server and report back :)

Thank you!

Samreay · 2021-08-17T11:42:50Z

Hi @dbczumar, Ive been unable to test this in our dev system because we can only get the UI to display the warning message below:

Unable to display MLflow UI - landing page (index.html) not found.

You are very likely running the MLflow server using a source installation of the Python MLflow
package.

If you are a developer making MLflow source code changes and intentionally running a source
installation of MLflow, you can view the UI by running the Javascript dev server:
https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#running-the-javascript-dev-server

Otherwise, uninstall MLflow via 'pip uninstall mlflow', reinstall an official MLflow release
from PyPI via 'pip install mlflow', and rerun the MLflow server.

The instructions in the contributing guide only detail how to start a local server with npm in the background and mlflow ui. However, we need to point this to an s3 bucket and remote server for a proper test (which we normally do using mlflow server). Do you know how to mesh the two?

I've also added 5-minute caching which should still resolve bulk artifact download issues, but caters to some of the boto3 endpoint urls which might have 12 hour expiry of pre-signed urls.

EDIT: I can still test the impact of the modifications on the mlflow library when not running it as a server (ie for the packages that talk to the server), I just wanted to test both for completeness. If you believe testing just the latter is sufficient (if the server doesnt use the Artifact downloading at any point) then even easier.

…nt urls Signed-off-by: Samuel Hinton <sh@arenko.group>

harupy · 2021-08-18T10:10:56Z

Hi @Samreay, I think mlflow server ... and cd mlflow/server/js && npm start should work together. Please let us know if they don't.

Samreay · 2021-08-18T10:12:48Z

Hey @harupy - thats similar to what I tried initially. I had an npm start run in the background and then kicked off mlflow server, but got the same UI in dev message.

harupy · 2021-08-18T10:13:47Z

Could you try npm run build (which may take a while to complete), then run mlflow server?

Samreay · 2021-08-18T10:17:52Z

Ill give it a shot and report back :)

Samreay · 2021-08-31T08:36:56Z

Hey @dbczumar , @harupy,

Been a couple of weeks, and thought I'd report back that we've been running both dev and since last week, prod, using this PR branch and have had no issues (and a nice performance improvement).

dbczumar

LGTM! Thanks @Samreay . Can you follow the instructions in https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work to sign your commits?

Samreay · 2021-09-09T08:05:31Z

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar

@Samreay looks like there are a couple of s3 artifact repo test failures. Can you take a look at those?

Signed-off-by: Samuel Hinton <samuelreay@gmail.com>

Samreay · 2021-09-09T11:39:12Z

Hey @dbczumar - interesting.

I see there are pylint issues, but I dont see anything when I run pylint myself nor are there any details I can see in the error log. Is there a different way to run? Running the actual ./lint.sh says everything needs to be reformatted, so I dont attach its results here.

(base) samreay@Samuels-MacBook-Pro artifact % pylint --rcfile ../../../pylintrc s3_artifact_repo.py

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

(base) samreay@Samuels-MacBook-Pro artifact % black s3_artifact_repo.py 
All done! ✨ 🍰 ✨
1 file left unchanged.

Ive updated the tests as well.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar · 2021-09-09T23:36:02Z

Hi @Samreay , thanks for making these changes! I've pushed formatting tweaks generated by running black --line-length=100 --exclude=mlflow/protos . from the root of the MLflow repo with black version 19.10b0.

dbczumar · 2021-09-10T04:48:08Z

@Samreay Looks like there are still a few failures in test_s3_artifact_repo.py

Samreay · 2021-09-10T09:06:36Z

Hey @dbczumar , apologies, the issues should be fixed now.

Samreay · 2021-09-10T17:57:45Z

@dbczumar And somehow there are new issues, of course. Im going away on leave, but Ill try and figure out whats happening here with the tests when Im back. If you have time to have a look first, just let me know :)

harupy · 2021-11-30T23:43:31Z

@Samreay Are you still working on this PR? If not, can I take it up?

Samreay · 2021-12-01T11:37:03Z

Hey @harupy , I haven't had a chance to look at it. I put a ticket on my backlog and then it was deprioritised and I have no clue when I'll get time allocation to fix up the tests. If this is something you're able to look at, I would greatly appreciate it!

harupy · 2021-12-01T16:15:31Z

Thanks for the reply, I think I can handle this!

Signed-off-by: harupy <hkawamura0130@gmail.com>

harupy · 2021-12-01T16:28:08Z

@Samreay I've pushed several changes. Do they look good?

Samreay · 2021-12-01T16:44:57Z

They look great, I'll have to remember autouse=True for future fixtures!

harupy · 2021-12-02T01:24:27Z

autoformat

Signed-off-by: harupy <hkawamura0130@gmail.com>

harupy · 2021-12-02T02:14:39Z

autoformat

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

harupy

LGTM!

github-actions bot added area/artifacts Artifact stores and artifact logging rn/bug-fix Mention under Bug Fixes in Changelogs. labels Aug 12, 2021

Caching boto client

c2f9a25

Signed-off-by: Samuel Hinton <sh@arenko.group>

Samreay force-pushed the master branch from 943eb8f to c2f9a25 Compare August 12, 2021 10:59

Samreay mentioned this pull request Aug 12, 2021

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

Closed

23 tasks

dbczumar reviewed Aug 12, 2021

View reviewed changes

Adding five minute cache expiry to handle potential temp boto3 endpoi…

04137af

…nt urls Signed-off-by: Samuel Hinton <sh@arenko.group>

teimstamp is not used on purpose

1018874

dbczumar approved these changes Sep 9, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into merge_mast

fa133eb

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar reviewed Sep 9, 2021

View reviewed changes

Samuel Hinton added 2 commits September 9, 2021 12:32

Fixing tests and adding sign off

25ede63

Signed-off-by: Samuel Hinton <samuelreay@gmail.com>

Merge branch 'master' of https://github.com/Samreay/mlflow

485d5ed

Format

b343d43

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Fixing tests

d61c991

harupy added 2 commits December 2, 2021 01:17

fix tests

71bfea5

Signed-off-by: harupy <hkawamura0130@gmail.com>

Merge branch 'master' into pr/Samreay/4695

6f31067

Signed-off-by: harupy <hkawamura0130@gmail.com>

lint

4835c5a

Signed-off-by: harupy <hkawamura0130@gmail.com>

Autoformat: https://github.com/mlflow/mlflow/actions/runs/1528968619

c1904c9

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

harupy approved these changes Dec 2, 2021

View reviewed changes

harupy merged commit 19a82fe into mlflow:master Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching boto client to improve artifact download speed #4695

Caching boto client to improve artifact download speed #4695

Samreay commented Aug 12, 2021

github-actions bot commented Aug 12, 2021

dbczumar left a comment

Samreay commented Aug 12, 2021

dbczumar commented Aug 12, 2021

Samreay commented Aug 17, 2021 •

edited

harupy commented Aug 18, 2021

Samreay commented Aug 18, 2021

harupy commented Aug 18, 2021 •

edited

Samreay commented Aug 18, 2021

Samreay commented Aug 31, 2021 •

edited

dbczumar left a comment

Samreay commented Sep 9, 2021

dbczumar left a comment

Samreay commented Sep 9, 2021 •

edited

dbczumar commented Sep 9, 2021

dbczumar commented Sep 10, 2021

Samreay commented Sep 10, 2021

Samreay commented Sep 10, 2021

harupy commented Nov 30, 2021 •

edited

Samreay commented Dec 1, 2021

harupy commented Dec 1, 2021

harupy commented Dec 1, 2021

Samreay commented Dec 1, 2021

harupy commented Dec 2, 2021

harupy commented Dec 2, 2021 •

edited

harupy left a comment

Caching boto client to improve artifact download speed #4695

Caching boto client to improve artifact download speed #4695

Conversation

Samreay commented Aug 12, 2021

What changes are proposed in this pull request?

How is this patch tested?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Aug 12, 2021

dbczumar left a comment

Choose a reason for hiding this comment

Samreay commented Aug 12, 2021

dbczumar commented Aug 12, 2021

Samreay commented Aug 17, 2021 • edited

harupy commented Aug 18, 2021

Samreay commented Aug 18, 2021

harupy commented Aug 18, 2021 • edited

Samreay commented Aug 18, 2021

Samreay commented Aug 31, 2021 • edited

dbczumar left a comment

Choose a reason for hiding this comment

Samreay commented Sep 9, 2021

dbczumar left a comment

Choose a reason for hiding this comment

Samreay commented Sep 9, 2021 • edited

dbczumar commented Sep 9, 2021

dbczumar commented Sep 10, 2021

Samreay commented Sep 10, 2021

Samreay commented Sep 10, 2021

harupy commented Nov 30, 2021 • edited

Samreay commented Dec 1, 2021

harupy commented Dec 1, 2021

harupy commented Dec 1, 2021

Samreay commented Dec 1, 2021

harupy commented Dec 2, 2021

harupy commented Dec 2, 2021 • edited

harupy left a comment

Choose a reason for hiding this comment

Samreay commented Aug 17, 2021 •

edited

harupy commented Aug 18, 2021 •

edited

Samreay commented Aug 31, 2021 •

edited

Samreay commented Sep 9, 2021 •

edited

harupy commented Nov 30, 2021 •

edited

harupy commented Dec 2, 2021 •

edited