Skip to content

[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 4, 2021

Conversation

smurching
Copy link
Collaborator

@smurching smurching commented Sep 20, 2020

What changes are proposed in this pull request?

Fixes mlflow.spark.log_model, mlflow.spark.load_model APIs on passthrough-enabled environments against ACL'd artifact locations.

Previously, we had the following behavior:

  • mlflow.spark.save_model: saves spark model to temporary directory on Spark's default distributed FS (e.g. DBFS on Databricks), then copies the persisted spark model directory to the desired local path using HDFS APIs called via Py4j. The latter step fails in passthrough environments as HDFS APIs aren't allowlisted
  • mlflow.spark.log_model: attempts to write directly to the destination directory, which fails for ACL'd artifact locations. Then, falls back to calling save_model & log_artifacts, which fails as described above
  • mlflow.spark.load_model: attempts to load model directly from the artifact location via Spark. If this fails, downloads model files locally, uploads them to Spark's distributed FS via HDFS APIs (fails due to lack of allowlisting), and then loads Spark model from the distributed FS

This PR makes the following changes when running on Databricks, for both passthrough-enabled & non-passthrough envs:

  • mlflow.spark.save_model: copy model files locally using DBFS FUSE instead of denylisted HDFS APIs. This also fixes log_model
  • mlflow.spark.load_model: upload locally-downloaded model files to Spark's distributed FS via DBFS FUSE instead of denylisted HDFS APIs.

How is this patch tested?

Manually

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Sorry, something went wrong.

Signed-off-by: Sid Murching <sid.murching@databricks.com>
…tifact ACL-protected DBFS locations

Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
@edgan8 edgan8 changed the title [WIP] Fix persistence of Spark models on passthrough-enabled environments [WIP] [ML-12132] Fix persistence of Spark models on passthrough-enabled environments Dec 17, 2020
@edgan8 edgan8 changed the title [WIP] [ML-12132] Fix persistence of Spark models on passthrough-enabled environments [ML-12132] Fix persistence of Spark models on passthrough-enabled environments Dec 17, 2020
mlflow/spark.py Outdated
_HadoopFileSystem.copy_to_local_file(tmp_path, sparkml_data_path, remove_src=True)
# If spark DFS is DBFS and we're running on a Databricks cluster, copy to local FS
# via the FUSE mount
is_saving_to_dbfs = is_valid_dbfs_uri(tmp_path) or posixpath.abspath(tmp_path) == tmp_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the abspath check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q, it's because persisting a spark model to a path like "/my-directory" actually saves to "dbfs:/my-directory".

I'll add this explanation/clarifying comment to the code, lmk if it makes sense:

    # If we're running on a Databricks cluster, we can detect whether the tempdir we're saving our
    # Spark model to is on DBFS, by checking if it either starts with dbfs:/
    # e.g. "dbfs:/my-directory" or is an absolute path like "/my-directory" (in which case Spark
    # saves to the default distributed FS => DBFS). If we're using a DBFS
    # tempdir, copy to local FS via the FUSE mount. Otherwise, use HDFS APIs to copy the model to
    # the local FS.



def dbfs_hdfs_uri_to_fuse_path(dbfs_uri):
# Convert posixpaths (e.g. "/tmp/mlflow") to DBFS URIs by adding "dbfs:/" as a prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring to this method so that it's more obvious that this comment just applies to the immediate if statement below and not to the whole method?


def dbfs_hdfs_uri_to_fuse_path(dbfs_uri):
# Convert posixpaths (e.g. "/tmp/mlflow") to DBFS URIs by adding "dbfs:/" as a prefix
if not is_valid_dbfs_uri(dbfs_uri) and dbfs_uri == posixpath.abspath(dbfs_uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dbfs_uri is a posixpath abspath there's no way it can be a dbfs:/ uri right?

is_saving_to_dbfs = is_valid_dbfs_uri(tmp_path) or posixpath.abspath(tmp_path) == tmp_path
if is_saving_to_dbfs and databricks_utils.is_in_cluster():
tmp_path_fuse = dbfs_hdfs_uri_to_fuse_path(tmp_path)
shutil.move(src=tmp_path_fuse, dst=sparkml_data_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the file get written to tmp_path fuse so we can move it? Are we assuming that the FUSE mount has been synced with HDFS? What if the sync hasn't finished yet?

Copy link
Collaborator Author

@smurching smurching Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this save_model method, the goal is to save the SparkML model to a local filesystem path

The flow here is:

  • Save SparkML model to a temporary dbfs location (dbfs:/tmp/my-model)
  • Move that persisted Spark model to the local filesystem using FUSE and a file move (move /dbs/tmp/my-model to the /my/desired/local/path)

We leverage FUSE to copy from DBFS -> local as there's no allowlisted JVM-side, HDFS API for copying from dbfs:/... to the local filesystem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FUSE access has the same consistency properties as standard HDFS-style access to the same DBFS path. That is, writing to dbfs:/tmp/my-model and reading /dbfs/tmp/my-model is equivalent from a blob-storage-eventual-consistency perspective to writing to dbfs:/tmp/my-model and subsequently reading dbfs:/tmp/my-model

fuse_dfs_tmpdir = dbfs_hdfs_uri_to_fuse_path(dfs_tmpdir)
os.mkdir(fuse_dfs_tmpdir)
# Workaround for inability to use shutil.copytree with DBFS FUSE due to permission-denied
# errors on passthrough-enabled clusters when attempting to copy permission bits for directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we factor out this copytree functionality into a more generic helper function?

Signed-off-by: Sid Murching <sid.murching@databricks.com>
mlflow/spark.py Outdated
relative_dir_path = os.path.relpath(os.path.join(dirpath, dirname), src_dir)
# Compute corresponding FUSE path of each local directory and create an equivalent
# FUSE directory
fuse_dir_path = os.path.join(dst_dir, relative_dir_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer specific to fuse, it's just "absolute_dir_path" right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep that's right! I also added a unit test for this method, although it's an internal one used only in the mlflow.spark module, because it's somewhat complex

Signed-off-by: Sid Murching <sid.murching@databricks.com>
Signed-off-by: Sid Murching <sid.murching@databricks.com>
try:
spark_model.save(os.path.join(model_dir, _SPARK_MODEL_PATH_SUB))
spark_model.save(posixpath.join(model_dir, _SPARK_MODEL_PATH_SUB))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: should be a posixpath join since HDFS URIs are posixpath style (e.g. we don't want to include backward slashes \ here on platforms like Windows)

Signed-off-by: Sid Murching <sid.murching@databricks.com>
Copy link
Contributor

@edgan8 edgan8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring! Can you update the PR description to clarify that this is changing dbfs access from HDFS to FUSE for all clusters (not just the passthrough ones that motivated this)?

@smurching smurching merged commit f3af81f into mlflow:master Jan 4, 2021
@smurching smurching added the rn/bug-fix Mention under Bug Fixes in Changelogs. label Jan 4, 2021
harupy pushed a commit to chauhang/mlflow that referenced this pull request Apr 8, 2021
…ironments (mlflow#3443)

Fix persistence of Spark models using mlflow.spark.log/load_model on passthrough-enabled Databricks clusters

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
harupy pushed a commit to wamartin-aml/mlflow that referenced this pull request Jun 7, 2021
…ironments (mlflow#3443)

Fix persistence of Spark models using mlflow.spark.log/load_model on passthrough-enabled Databricks clusters

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants