[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

Samreay · 2021-08-06T14:14:24Z

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS, Linux
MLflow installed from (source or binary): pip installed
MLflow version (run mlflow --version): 1.18.0 (applies to master)
Python version: 3.8.5
npm version, if running the dev UI:
Exact command to reproduce:

Describe the problem

Mlflow is very slow when downloading complex models that have many artifacts. Investigations show that this is because every single file downloaded via S3ArtifactRepository._download_file generates a brand new client using _get_s3_client, which includes verification.

Having verification is good, but if I am downloading 100 files, making 100 clients and verifying the same details 100 times seems like a bug.

Code to reproduce issue

Any code which invokes _download_file, such as using mlflow.pyfunc.load_model

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Other info

To debug this, I added a print statement both to download_file and to the get_s3_client methods in S3ArtifactRepository.

Here is a snippet:

Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/MLmodel
Verifying new client
Downloaded URI model/artifacts/model11/conda.yaml
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/custom_objects.cloudpickle
Verifying new client
Downloaded URI model/artifacts/model11/data/keras_module.txt
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/model/saved_model.pb

With a simple lru_cache added to the _get_s3_client, the output became the following (snippet):

Verifying new client
Downloaded URI model/MLmodel
Downloaded URI model/artifacts/model0/MLmodel
Downloaded URI model/artifacts/model0/conda.yaml
Downloaded URI model/artifacts/model0/data/custom_objects.cloudpickle
Downloaded URI model/artifacts/model0/data/keras_module.txt
Downloaded URI model/artifacts/model0/data/model/saved_model.pb
Downloaded URI model/artifacts/model0/data/model/variables/variables.data-00000-of-00001
Downloaded URI model/artifacts/model0/data/model/variables/variables.index
Downloaded URI model/artifacts/model0/data/save_format.txt

It was also approximately three times faster to download the resources when the repeated verification was removed.

I can see a few potential solutions:

Adding something likelru_cache to the S3ArtifactRepository._get_s3_client method
Lazy client loading. Aka saving the client into self such that _get_s3_client is only invoked when the client is None. However it does seem that the class is meant to be effectively stateless so this might not be desired
Instead of one client per file, we could make it one client per tree, but this would require changes to the parent ArtifactRepository.download_artifacts which is probably not desired
Creating a cache specifically around the boto3.client invocation, such that line 60 is separated out into a separate cachable function:

    def _get_s3_client(self):
        import boto3
        from botocore.client import Config

        s3_endpoint_url = os.environ.get("MLFLOW_S3_ENDPOINT_URL")
        ignore_tls = os.environ.get("MLFLOW_S3_IGNORE_TLS")

        do_verify = True
        if ignore_tls:
            do_verify = ignore_tls.lower() not in ["true", "yes", "1"]

        # The valid verify argument value is None/False/path to cert bundle file, See
        # https://github.com/boto/boto3/blob/73865126cad3938ca80a2f567a1c79cb248169a7/
        # boto3/session.py#L212
        verify = None if do_verify else False

        # NOTE: If you need to specify this env variable, please file an issue at
        # https://github.com/mlflow/mlflow/issues so we know your use-case!
        signature_version = os.environ.get("MLFLOW_EXPERIMENTAL_S3_SIGNATURE_VERSION", "s3v4")

        return self._get_boto3(signature_version, s3_endpoint_url, verify)

    @lru_cache()
    def _get_boto3(signature_version, s3_endpoint_url, verify):
        return boto3.client(
            "s3",
            config=Config(signature_version=signature_version),
            endpoint_url=s3_endpoint_url,
            verify=verify,
        )

Add time expiring cache to the above solution such that a new client is generated every minute or similar. Theres from cachetools import TTLCache but this may add an unwanted dependency, so we could simply add an extra time based hash input to _get_boto3. If something like time.time() / some_num is used as input then it wouldn't guarantee any one TTL expiry, but would give at most 2 client verifications in any some_num time range, which is still a large performance gain given the use case is downloading one or more models sequentially which contain many files.

Happy to implement any of these if a dev can provide guidance on the desired solution.

The text was updated successfully, but these errors were encountered:

dbczumar · 2021-08-11T17:50:15Z

Hi @Samreay , thank you for raising this issue and providing suggestions. I think that suggestion #4 sounds great, provided that a boto3 client object does not expire after some time. If there's a risk of client expiration, something in the vein of proposal #5 sounds good, though I'd hesitate to bring in a new dependency for this purpose. Let me know what you think with regards to client expiration risk from an instance-level LRU cache.

Samreay · 2021-08-11T17:51:52Z

Fantastic, I'll code it up #4 and submit a PR tomorrow for this after checking the boto3 docs for expiration.

Samreay · 2021-08-12T11:02:43Z

Hey @dbczumar, Ive opened #4695, Im unsure how we should properly test this and the depth of testing warranted (as there are no client-facing changes or public methods).

harupy · 2021-12-02T12:00:51Z

Addressed by #4695

Samreay added the bug Something isn't working label Aug 6, 2021

github-actions bot added the area/artifacts Artifact stores and artifact logging label Aug 6, 2021

Samreay mentioned this issue Aug 12, 2021

Caching boto client to improve artifact download speed #4695

Merged

27 tasks

harupy closed this as completed Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

Samreay commented Aug 6, 2021 •

edited

dbczumar commented Aug 11, 2021 •

edited

Samreay commented Aug 11, 2021

Samreay commented Aug 12, 2021

harupy commented Dec 2, 2021

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

Comments

Samreay commented Aug 6, 2021 • edited

Willingness to contribute

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s), interfaces, languages, and integrations does this bug affect?

Other info

dbczumar commented Aug 11, 2021 • edited

Samreay commented Aug 11, 2021

Samreay commented Aug 12, 2021

harupy commented Dec 2, 2021

Samreay commented Aug 6, 2021 •

edited

dbczumar commented Aug 11, 2021 •

edited