Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] S3 Artifact Downloading is Slow due to Multiple Client Verificiations #4668

Closed
2 of 23 tasks
Samreay opened this issue Aug 6, 2021 · 4 comments
Closed
2 of 23 tasks
Labels
area/artifacts Artifact stores and artifact logging bug Something isn't working

Comments

@Samreay
Copy link
Contributor

Samreay commented Aug 6, 2021

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS, Linux
  • MLflow installed from (source or binary): pip installed
  • MLflow version (run mlflow --version): 1.18.0 (applies to master)
  • Python version: 3.8.5
  • npm version, if running the dev UI:
  • Exact command to reproduce:

Describe the problem

Mlflow is very slow when downloading complex models that have many artifacts. Investigations show that this is because every single file downloaded via S3ArtifactRepository._download_file generates a brand new client using _get_s3_client, which includes verification.

Having verification is good, but if I am downloading 100 files, making 100 clients and verifying the same details 100 times seems like a bug.

Code to reproduce issue

Any code which invokes _download_file, such as using mlflow.pyfunc.load_model

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Other info

To debug this, I added a print statement both to download_file and to the get_s3_client methods in S3ArtifactRepository.

Here is a snippet:

Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/MLmodel
Verifying new client
Downloaded URI model/artifacts/model11/conda.yaml
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/custom_objects.cloudpickle
Verifying new client
Downloaded URI model/artifacts/model11/data/keras_module.txt
Verifying new client
Verifying new client
Downloaded URI model/artifacts/model11/data/model/saved_model.pb

With a simple lru_cache added to the _get_s3_client, the output became the following (snippet):

Verifying new client
Downloaded URI model/MLmodel
Downloaded URI model/artifacts/model0/MLmodel
Downloaded URI model/artifacts/model0/conda.yaml
Downloaded URI model/artifacts/model0/data/custom_objects.cloudpickle
Downloaded URI model/artifacts/model0/data/keras_module.txt
Downloaded URI model/artifacts/model0/data/model/saved_model.pb
Downloaded URI model/artifacts/model0/data/model/variables/variables.data-00000-of-00001
Downloaded URI model/artifacts/model0/data/model/variables/variables.index
Downloaded URI model/artifacts/model0/data/save_format.txt

It was also approximately three times faster to download the resources when the repeated verification was removed.

I can see a few potential solutions:

  1. Adding something likelru_cache to the S3ArtifactRepository._get_s3_client method
  2. Lazy client loading. Aka saving the client into self such that _get_s3_client is only invoked when the client is None. However it does seem that the class is meant to be effectively stateless so this might not be desired
  3. Instead of one client per file, we could make it one client per tree, but this would require changes to the parent ArtifactRepository.download_artifacts which is probably not desired
  4. Creating a cache specifically around the boto3.client invocation, such that line 60 is separated out into a separate cachable function:
    def _get_s3_client(self):
        import boto3
        from botocore.client import Config

        s3_endpoint_url = os.environ.get("MLFLOW_S3_ENDPOINT_URL")
        ignore_tls = os.environ.get("MLFLOW_S3_IGNORE_TLS")

        do_verify = True
        if ignore_tls:
            do_verify = ignore_tls.lower() not in ["true", "yes", "1"]

        # The valid verify argument value is None/False/path to cert bundle file, See
        # https://github.com/boto/boto3/blob/73865126cad3938ca80a2f567a1c79cb248169a7/
        # boto3/session.py#L212
        verify = None if do_verify else False

        # NOTE: If you need to specify this env variable, please file an issue at
        # https://github.com/mlflow/mlflow/issues so we know your use-case!
        signature_version = os.environ.get("MLFLOW_EXPERIMENTAL_S3_SIGNATURE_VERSION", "s3v4")

        return self._get_boto3(signature_version, s3_endpoint_url, verify)

    @lru_cache()
    def _get_boto3(signature_version, s3_endpoint_url, verify):
        return boto3.client(
            "s3",
            config=Config(signature_version=signature_version),
            endpoint_url=s3_endpoint_url,
            verify=verify,
        )
  1. Add time expiring cache to the above solution such that a new client is generated every minute or similar. Theres from cachetools import TTLCache but this may add an unwanted dependency, so we could simply add an extra time based hash input to _get_boto3. If something like time.time() / some_num is used as input then it wouldn't guarantee any one TTL expiry, but would give at most 2 client verifications in any some_num time range, which is still a large performance gain given the use case is downloading one or more models sequentially which contain many files.

Happy to implement any of these if a dev can provide guidance on the desired solution.

@Samreay Samreay added the bug Something isn't working label Aug 6, 2021
@github-actions github-actions bot added the area/artifacts Artifact stores and artifact logging label Aug 6, 2021
@dbczumar
Copy link
Collaborator

dbczumar commented Aug 11, 2021

Hi @Samreay , thank you for raising this issue and providing suggestions. I think that suggestion #4 sounds great, provided that a boto3 client object does not expire after some time. If there's a risk of client expiration, something in the vein of proposal #5 sounds good, though I'd hesitate to bring in a new dependency for this purpose. Let me know what you think with regards to client expiration risk from an instance-level LRU cache.

@Samreay
Copy link
Contributor Author

Samreay commented Aug 11, 2021

Fantastic, I'll code it up #4 and submit a PR tomorrow for this after checking the boto3 docs for expiration.

@Samreay
Copy link
Contributor Author

Samreay commented Aug 12, 2021

Hey @dbczumar, Ive opened #4695, Im unsure how we should properly test this and the depth of testing warranted (as there are no client-facing changes or public methods).

@harupy
Copy link
Member

harupy commented Dec 2, 2021

Addressed by #4695

@harupy harupy closed this as completed Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants