New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching boto client to improve artifact download speed #4695
Conversation
@Samreay Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/3310771499. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details. |
Signed-off-by: Samuel Hinton <sh@arenko.group>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Samreay this looks great! Can you try this out manually and add a description of the manual test results to the PR description? I'll do the same!
Yup, can do, I'll let it run for a few days in our dev server and report back :) |
Thank you! |
Hi @dbczumar, Ive been unable to test this in our dev system because we can only get the UI to display the warning message below:
The instructions in the contributing guide only detail how to start a local server with npm in the background and I've also added 5-minute caching which should still resolve bulk artifact download issues, but caters to some of the boto3 endpoint urls which might have 12 hour expiry of pre-signed urls. EDIT: I can still test the impact of the modifications on the mlflow library when not running it as a server (ie for the packages that talk to the server), I just wanted to test both for completeness. If you believe testing just the latter is sufficient (if the server doesnt use the Artifact downloading at any point) then even easier. |
…nt urls Signed-off-by: Samuel Hinton <sh@arenko.group>
Hi @Samreay, I think |
Hey @harupy - thats similar to what I tried initially. I had an npm start run in the background and then kicked off mlflow server, but got the same UI in dev message. |
Could you try |
Ill give it a shot and report back :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @Samreay . Can you follow the instructions in https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work to sign your commits?
Developer Certificate of Origin Copyright (C) 2004, 2006 The Linux Foundation and its contributors. Everyone is permitted to copy and distribute verbatim copies of this Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I (b) The contribution is based upon previous work that, to the best (c) The contribution was provided directly to me by some other (d) I understand and agree that this project and the contribution |
Signed-off-by: dbczumar <corey.zumar@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Samreay looks like there are a couple of s3 artifact repo test failures. Can you take a look at those?
Signed-off-by: Samuel Hinton <samuelreay@gmail.com>
Hey @dbczumar - interesting. I see there are pylint issues, but I dont see anything when I run pylint myself nor are there any details I can see in the error log. Is there a different way to run? Running the actual
Ive updated the tests as well. |
Hi @Samreay , thanks for making these changes! I've pushed formatting tweaks generated by running |
@Samreay Looks like there are still a few failures in |
Hey @dbczumar , apologies, the issues should be fixed now. |
@dbczumar And somehow there are new issues, of course. Im going away on leave, but Ill try and figure out whats happening here with the tests when Im back. If you have time to have a look first, just let me know :) |
@Samreay Are you still working on this PR? If not, can I take it up? |
Hey @harupy , I haven't had a chance to look at it. I put a ticket on my backlog and then it was deprioritised and I have no clue when I'll get time allocation to fix up the tests. If this is something you're able to look at, I would greatly appreciate it! |
Thanks for the reply, I think I can handle this! |
Signed-off-by: harupy <hkawamura0130@gmail.com>
@Samreay I've pushed several changes. Do they look good? |
They look great, I'll have to remember |
autoformat |
autoformat |
Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What changes are proposed in this pull request?
As per #4668, the S3ArtifactDownloader is excessively slow as each file requires a new boto3 client to be instantiated and verified. For larger models with many files, this represents a significant slowdown, extending the download time by around 200%.
This PR separates out the boto3 client creation into a cached function.
As per the discussion in the linked issue, there appears to be no time based expiry of boto3 S3 clients in the documentation:
How is this patch tested?
I am unsure about how to test
mlflow
properly and would appreciate some guidance on this.If the code itself looks acceptable, I can carry out my own testing on this locally over a longer time period (several days to validate the client expiry concerns).
Release Notes
Is this a user-facing change?
(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes