You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.
Please fill in this bug report template to ensure a timely and thorough response.
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.
System information
Have I written custom code (as opposed to using a stock example script provided in MLflow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS, Linux
MLflow installed from (source or binary): pip installed
MLflow version (run mlflow --version): 1.18.0 (applies to master)
Python version: 3.8.5
npm version, if running the dev UI:
Exact command to reproduce:
Describe the problem
Mlflow is very slow when downloading complex models that have many artifacts. Investigations show that this is because every single file downloaded via S3ArtifactRepository._download_file generates a brand new client using _get_s3_client, which includes verification.
Having verification is good, but if I am downloading 100 files, making 100 clients and verifying the same details 100 times seems like a bug.
Code to reproduce issue
Any code which invokes _download_file, such as using mlflow.pyfunc.load_model
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
What component(s), interfaces, languages, and integrations does this bug affect?
Components
area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support
Language
language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages
Integrations
integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations
Other info
To debug this, I added a print statement both to download_file and to the get_s3_client methods in S3ArtifactRepository.
Here is a snippet:
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/MLmodel
Verifying new client
Downloaded URI model/artifacts/model11/conda.yaml
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/custom_objects.cloudpickle
Verifying new client
Downloaded URI model/artifacts/model11/data/keras_module.txt
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/model/saved_model.pb
With a simple lru_cache added to the _get_s3_client, the output became the following (snippet):
Verifying new client
Downloaded URI model/MLmodel
Downloaded URI model/artifacts/model0/MLmodel
Downloaded URI model/artifacts/model0/conda.yaml
Downloaded URI model/artifacts/model0/data/custom_objects.cloudpickle
Downloaded URI model/artifacts/model0/data/keras_module.txt
Downloaded URI model/artifacts/model0/data/model/saved_model.pb
Downloaded URI model/artifacts/model0/data/model/variables/variables.data-00000-of-00001
Downloaded URI model/artifacts/model0/data/model/variables/variables.index
Downloaded URI model/artifacts/model0/data/save_format.txt
It was also approximately three times faster to download the resources when the repeated verification was removed.
I can see a few potential solutions:
Adding something likelru_cache to the S3ArtifactRepository._get_s3_client method
Lazy client loading. Aka saving the client into self such that _get_s3_client is only invoked when the client is None. However it does seem that the class is meant to be effectively stateless so this might not be desired
Instead of one client per file, we could make it one client per tree, but this would require changes to the parent ArtifactRepository.download_artifacts which is probably not desired
Creating a cache specifically around the boto3.client invocation, such that line 60 is separated out into a separate cachable function:
def_get_s3_client(self):
importboto3frombotocore.clientimportConfigs3_endpoint_url=os.environ.get("MLFLOW_S3_ENDPOINT_URL")
ignore_tls=os.environ.get("MLFLOW_S3_IGNORE_TLS")
do_verify=Trueifignore_tls:
do_verify=ignore_tls.lower() notin ["true", "yes", "1"]
# The valid verify argument value is None/False/path to cert bundle file, See# https://github.com/boto/boto3/blob/73865126cad3938ca80a2f567a1c79cb248169a7/# boto3/session.py#L212verify=Noneifdo_verifyelseFalse# NOTE: If you need to specify this env variable, please file an issue at# https://github.com/mlflow/mlflow/issues so we know your use-case!signature_version=os.environ.get("MLFLOW_EXPERIMENTAL_S3_SIGNATURE_VERSION", "s3v4")
returnself._get_boto3(signature_version, s3_endpoint_url, verify)
@lru_cache()def_get_boto3(signature_version, s3_endpoint_url, verify):
returnboto3.client(
"s3",
config=Config(signature_version=signature_version),
endpoint_url=s3_endpoint_url,
verify=verify,
)
Add time expiring cache to the above solution such that a new client is generated every minute or similar. Theres from cachetools import TTLCache but this may add an unwanted dependency, so we could simply add an extra time based hash input to _get_boto3. If something like time.time() / some_num is used as input then it wouldn't guarantee any one TTL expiry, but would give at most 2 client verifications in any some_num time range, which is still a large performance gain given the use case is downloading one or more models sequentially which contain many files.
Happy to implement any of these if a dev can provide guidance on the desired solution.
The text was updated successfully, but these errors were encountered:
Hi @Samreay , thank you for raising this issue and providing suggestions. I think that suggestion #4 sounds great, provided that a boto3 client object does not expire after some time. If there's a risk of client expiration, something in the vein of proposal #5 sounds good, though I'd hesitate to bring in a new dependency for this purpose. Let me know what you think with regards to client expiration risk from an instance-level LRU cache.
Hey @dbczumar, Ive opened #4695, Im unsure how we should properly test this and the depth of testing warranted (as there are no client-facing changes or public methods).
Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.
Please fill in this bug report template to ensure a timely and thorough response.
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
System information
mlflow --version
): 1.18.0 (applies to master)Describe the problem
Mlflow is very slow when downloading complex models that have many artifacts. Investigations show that this is because every single file downloaded via
S3ArtifactRepository._download_file
generates a brand new client using_get_s3_client
, which includes verification.Having verification is good, but if I am downloading 100 files, making 100 clients and verifying the same details 100 times seems like a bug.
Code to reproduce issue
Any code which invokes
_download_file
, such as usingmlflow.pyfunc.load_model
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
What component(s), interfaces, languages, and integrations does this bug affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsOther info
To debug this, I added a
print
statement both to download_file and to the get_s3_client methods inS3ArtifactRepository
.Here is a snippet:
With a simple
lru_cache
added to the_get_s3_client
, the output became the following (snippet):It was also approximately three times faster to download the resources when the repeated verification was removed.
I can see a few potential solutions:
lru_cache
to theS3ArtifactRepository._get_s3_client
methodself
such that_get_s3_client
is only invoked when the client is None. However it does seem that the class is meant to be effectively stateless so this might not be desiredArtifactRepository.download_artifacts
which is probably not desiredfrom cachetools import TTLCache
but this may add an unwanted dependency, so we could simply add an extra time based hash input to_get_boto3
. If something liketime.time() / some_num
is used as input then it wouldn't guarantee any one TTL expiry, but would give at most 2 client verifications in anysome_num
time range, which is still a large performance gain given the use case is downloading one or more models sequentially which contain many files.Happy to implement any of these if a dev can provide guidance on the desired solution.
The text was updated successfully, but these errors were encountered: