[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

smurching · 2020-09-20T20:42:53Z

What changes are proposed in this pull request?

Fixes mlflow.spark.log_model, mlflow.spark.load_model APIs on passthrough-enabled environments against ACL'd artifact locations.

Previously, we had the following behavior:

mlflow.spark.save_model: saves spark model to temporary directory on Spark's default distributed FS (e.g. DBFS on Databricks), then copies the persisted spark model directory to the desired local path using HDFS APIs called via Py4j. The latter step fails in passthrough environments as HDFS APIs aren't allowlisted
mlflow.spark.log_model: attempts to write directly to the destination directory, which fails for ACL'd artifact locations. Then, falls back to calling save_model & log_artifacts, which fails as described above
mlflow.spark.load_model: attempts to load model directly from the artifact location via Spark. If this fails, downloads model files locally, uploads them to Spark's distributed FS via HDFS APIs (fails due to lack of allowlisting), and then loads Spark model from the distributed FS

This PR makes the following changes when running on Databricks, for both passthrough-enabled & non-passthrough envs:

mlflow.spark.save_model: copy model files locally using DBFS FUSE instead of denylisted HDFS APIs. This also fixes log_model
mlflow.spark.load_model: upload locally-downloaded model files to Spark's distributed FS via DBFS FUSE instead of denylisted HDFS APIs.

How is this patch tested?

Manually

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, JavaScript, plotting
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Sid Murching <sid.murching@databricks.com>

…tifact ACL-protected DBFS locations Signed-off-by: Sid Murching <sid.murching@databricks.com>

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8 · 2020-12-17T22:44:58Z

mlflow/spark.py

-    _HadoopFileSystem.copy_to_local_file(tmp_path, sparkml_data_path, remove_src=True)
+    # If spark DFS is DBFS and we're running on a Databricks cluster, copy to local FS
+    # via the FUSE mount
+    is_saving_to_dbfs = is_valid_dbfs_uri(tmp_path) or posixpath.abspath(tmp_path) == tmp_path


Why do we need the abspath check?

Good Q, it's because persisting a spark model to a path like "/my-directory" actually saves to "dbfs:/my-directory".

I'll add this explanation/clarifying comment to the code, lmk if it makes sense:

# If we're running on a Databricks cluster, we can detect whether the tempdir we're saving our # Spark model to is on DBFS, by checking if it either starts with dbfs:/ # e.g. "dbfs:/my-directory" or is an absolute path like "/my-directory" (in which case Spark # saves to the default distributed FS => DBFS). If we're using a DBFS # tempdir, copy to local FS via the FUSE mount. Otherwise, use HDFS APIs to copy the model to # the local FS.

edgan8 · 2020-12-17T22:49:07Z

mlflow/utils/uri.py

+
+
+def dbfs_hdfs_uri_to_fuse_path(dbfs_uri):
+    # Convert posixpaths (e.g. "/tmp/mlflow") to DBFS URIs by adding "dbfs:/" as a prefix


Can you add a docstring to this method so that it's more obvious that this comment just applies to the immediate if statement below and not to the whole method?

edgan8 · 2020-12-17T22:49:54Z

mlflow/utils/uri.py

+
+def dbfs_hdfs_uri_to_fuse_path(dbfs_uri):
+    # Convert posixpaths (e.g. "/tmp/mlflow") to DBFS URIs by adding "dbfs:/" as a prefix
+    if not is_valid_dbfs_uri(dbfs_uri) and dbfs_uri == posixpath.abspath(dbfs_uri):


if dbfs_uri is a posixpath abspath there's no way it can be a dbfs:/ uri right?

edgan8 · 2020-12-17T23:00:47Z

mlflow/spark.py

+    is_saving_to_dbfs = is_valid_dbfs_uri(tmp_path) or posixpath.abspath(tmp_path) == tmp_path
+    if is_saving_to_dbfs and databricks_utils.is_in_cluster():
+        tmp_path_fuse = dbfs_hdfs_uri_to_fuse_path(tmp_path)
+        shutil.move(src=tmp_path_fuse, dst=sparkml_data_path)


When does the file get written to tmp_path fuse so we can move it? Are we assuming that the FUSE mount has been synced with HDFS? What if the sync hasn't finished yet?

In this save_model method, the goal is to save the SparkML model to a local filesystem path

The flow here is:

Save SparkML model to a temporary dbfs location (dbfs:/tmp/my-model)

Move that persisted Spark model to the local filesystem using FUSE and a file move (move /dbs/tmp/my-model to the /my/desired/local/path)

We leverage FUSE to copy from DBFS -> local as there's no allowlisted JVM-side, HDFS API for copying from dbfs:/... to the local filesystem

The FUSE access has the same consistency properties as standard HDFS-style access to the same DBFS path. That is, writing to dbfs:/tmp/my-model and reading /dbfs/tmp/my-model is equivalent from a blob-storage-eventual-consistency perspective to writing to dbfs:/tmp/my-model and subsequently reading dbfs:/tmp/my-model

edgan8 · 2020-12-17T23:31:03Z

mlflow/spark.py

+    fuse_dfs_tmpdir = dbfs_hdfs_uri_to_fuse_path(dfs_tmpdir)
+    os.mkdir(fuse_dfs_tmpdir)
+    # Workaround for inability to use shutil.copytree with DBFS FUSE due to permission-denied
+    # errors on passthrough-enabled clusters when attempting to copy permission bits for directories


Can we factor out this copytree functionality into a more generic helper function?

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8 · 2020-12-19T00:52:10Z

mlflow/spark.py

+            relative_dir_path = os.path.relpath(os.path.join(dirpath, dirname), src_dir)
+            # Compute corresponding FUSE path of each local directory and create an equivalent
+            # FUSE directory
+            fuse_dir_path = os.path.join(dst_dir, relative_dir_path)


This is no longer specific to fuse, it's just "absolute_dir_path" right?

👍 yep that's right! I also added a unit test for this method, although it's an internal one used only in the mlflow.spark module, because it's somewhat complex

Signed-off-by: Sid Murching <sid.murching@databricks.com>

smurching · 2021-01-04T17:22:36Z

mlflow/spark.py

    try:
-        spark_model.save(os.path.join(model_dir, _SPARK_MODEL_PATH_SUB))
+        spark_model.save(posixpath.join(model_dir, _SPARK_MODEL_PATH_SUB))


Note: should be a posixpath join since HDFS URIs are posixpath style (e.g. we don't want to include backward slashes \ here on platforms like Windows)

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8

Nice refactoring! Can you update the PR description to clarify that this is changing dbfs access from HDFS to FUSE for all clusters (not just the passthrough ones that motivated this)?

…ironments (mlflow#3443) Fix persistence of Spark models using mlflow.spark.log/load_model on passthrough-enabled Databricks clusters Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>

smurching added 19 commits September 17, 2020 13:08

Update spark persistence logic

7d62479

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix persistence of Spark models on passthrough-enabled clusters to ar…

e37c012

…tifact ACL-protected DBFS locations Signed-off-by: Sid Murching <sid.murching@databricks.com>

Update

a7e46c7

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Updates

410b61a

Signed-off-by: Sid Murching <sid.murching@databricks.com>

WIP

ce0e85c

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix test

ae5b93d

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix

07c6b15

Signed-off-by: Sid Murching <sid.murching@databricks.com>

WIP

32508d1

Signed-off-by: Sid Murching <sid.murching@databricks.com>

WIP

526fbe8

Signed-off-by: Sid Murching <sid.murching@databricks.com>

WIP

94ed962

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Simplify PR, remove optimization for log_model

fa627bf

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix lint

e700613

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix lint & tests

a0086d4

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Fix

4d49130

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Merge branch 'master' into spark-persistence-updates

29f4e59

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Merge branch 'master' into spark-persistence-updates

a27502e

Signed-off-by: Sid Murching <sid.murching@databricks.com>

WIP, but doesn't fix spark UDF issue because addFile isn't allowlisted

91fd077

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Revert spark model cache change

94689a1

Signed-off-by: Sid Murching <sid.murching@databricks.com>

Remove unnecessary default

cc6c065

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8 changed the title ~~[WIP] Fix persistence of Spark models on passthrough-enabled environments~~ [WIP] [ML-12132] Fix persistence of Spark models on passthrough-enabled environments Dec 17, 2020

edgan8 changed the title ~~[WIP] [ML-12132] Fix persistence of Spark models on passthrough-enabled environments~~ [ML-12132] Fix persistence of Spark models on passthrough-enabled environments Dec 17, 2020

edgan8 reviewed Dec 17, 2020

View reviewed changes

Address comments

a770394

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8 reviewed Dec 19, 2020

View reviewed changes

smurching added 2 commits January 1, 2021 10:04

Address comment and add test for somewhat complex copytree helper method

895abaa

Signed-off-by: Sid Murching <sid.murching@databricks.com>

lint

c702c34

Signed-off-by: Sid Murching <sid.murching@databricks.com>

smurching commented Jan 4, 2021

View reviewed changes

Clarify save_model implementation

5bbd1fb

Signed-off-by: Sid Murching <sid.murching@databricks.com>

edgan8 approved these changes Jan 4, 2021

View reviewed changes

smurching merged commit f3af81f into mlflow:master Jan 4, 2021

smurching added the rn/bug-fix label Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

smurching commented Sep 20, 2020 •

edited

Loading

edgan8 Dec 17, 2020

smurching Dec 18, 2020

edgan8 Dec 17, 2020

edgan8 Dec 17, 2020

edgan8 Dec 17, 2020

smurching Dec 18, 2020 •

edited

Loading

smurching Dec 18, 2020

edgan8 Dec 17, 2020

edgan8 Dec 19, 2020

smurching Jan 1, 2021

smurching Jan 4, 2021

edgan8 left a comment



		def dbfs_hdfs_uri_to_fuse_path(dbfs_uri):
		# Convert posixpaths (e.g. "/tmp/mlflow") to DBFS URIs by adding "dbfs:/" as a prefix

[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

[ML-12132] Fix persistence of Spark models on passthrough-enabled environments #3443

Conversation

smurching commented Sep 20, 2020 • edited Loading

What changes are proposed in this pull request?

How is this patch tested?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgan8 left a comment

Choose a reason for hiding this comment

smurching commented Sep 20, 2020 •

edited

Loading

smurching Dec 18, 2020 •

edited

Loading