[pyspark] make the model saved by pyspark compatible #8219

wbo4958 · 2022-09-02T11:13:18Z

Users can't directly load the model using xgboost python package trained by pyspark. it requires much effort to do that, see #8186. This PR first saves the model in JSON format and then writes it to txt file. Then the user can easily load the model by

import xgboost as xgb
bst = xgb.Booster()

YOUR_MODEL_PATH="xxx"
bst.load_model(YOUR_MODEL_PATH/model/part-00000)

wbo4958 · 2022-09-06T05:58:10Z

@trivialfis Could you help to check a python test failed

[2022-09-05T12:59:12.999Z] =================================== FAILURES ===================================

[2022-09-05T12:59:12.999Z] ____________________________ test_gpu_data_iterator ____________________________

[2022-09-05T12:59:12.999Z] 

[2022-09-05T12:59:12.999Z] cls = <class '_pytest.runner.CallInfo'>

wbo4958 · 2022-09-08T06:36:38Z

@WeichenXu123 @trivialfis Could you help to review this PR?

trivialfis · 2022-09-08T12:08:33Z

Will look into it tomorrow.

wbo4958 · 2022-09-13T10:19:26Z

@WeichenXu123 @trivialfis Any feedback for this PR?

wbo4958 · 2022-09-14T09:27:43Z

Hi @WeichenXu123 @trivialfis, could you help to review it?

trivialfis · 2022-09-14T17:08:20Z

Can we document the function get_booster(self) and let the user extract the booster? I think it's easier.

wbo4958 · 2022-09-14T22:00:09Z

Yeah, We can, but the issue will be the same with the one JVM package previously encountered. Most users dump the mode by the spark way, they may don't like to do another get_booster.save_model again. So the model may be moved to another machine (or another team without any knowledge of spark) without spark cluster deployed since users may just want to load the model with python package and do some prediction. in that case, it's really un-convenient for users. This PR is supposed not to introduce any side effects, so I think it's ok to be merged.

trivialfis · 2022-09-14T16:40:05Z

python-package/xgboost/spark/model.py

@@ -21,34 +21,28 @@ def _get_or_create_tmp_dir():
    return xgb_tmp_dir


-def serialize_xgb_model(model):
+def dump_model_to_json_file(model) -> str:


Please use the term save. Dump has a specific meaning in XGBoost's code base.

trivialfis · 2022-09-14T23:07:41Z

python-package/xgboost/spark/model.py

-        ).write.parquet(model_save_path)
+        model_save_path = os.path.join(path, "model")
+        xgb_model_file = dump_model_to_json_file(xgb_model)
+        # The json file written by Spark base on `booster.save_raw("json").decode("utf-8")`


There are some " \ " " in the json file which can't be loaded by xgboost. Do you want to check more?

I will take a look tomorrow

@trivialfis No need anymore, I just found another way to do it.

…format

WeichenXu123 · 2022-09-15T09:35:56Z

python-package/xgboost/spark/model.py

+        xgb_model_file = save_model_to_json_file(xgb_model)
+        # The json file written by Spark base on `booster.save_raw("json").decode("utf-8")`
+        # can't be loaded by XGBoost directly.
+        _get_spark_session().read.text(xgb_model_file).write.text(model_save_path)


_get_spark_session().read.text(xgb_model_file).

This line is not correct.
spark.read.text(path) the path must be a distributed file system path which all spark executor can access.

You can use distributed FS API to copy local file xgb_model_file into the model saved path (a hadoop FS path)

wow, right. you're correct, @WeichenXu123 Good findings. Could you point me to what is the "distributed FS API"? Really appreciate it.

You can use this:
https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html

But, this does not support DBFS (databricks filesystem), we need support databricks case as well.
Databricks mount dbfs:/xxx/xxx to local file system /dbfs/xxx/xxx.

The example code in the PR description

import xgboost as xgb bst = xgb.Booster() # Basically, YOUR_MODEL_PATH should be like "xxxx/model/xxx.txt" YOUR_MODEL_PATH="xxx" bst.load_model(YOUR_MODEL_PATH)

seems does not wok ? If the path is a distributed FS path ?

@WeichenXu123, I use the RDD to save the text file, it should work with all kinds of hadoop-compatible FS..

WeichenXu123 · 2022-09-15T10:42:51Z

Do we really need this PR ?
User can load pyspark model and then call pyspark_model.booster to get the raw booster model.

wbo4958 · 2022-09-15T22:29:34Z

Do we really need this PR ? User can load pyspark model and then call pyspark_model.booster to get the raw booster model.

Yeah, guess the scenario, the data scientist who does not know spark gets a model saved by xgboost-spark and wants to load it by xgboost python package, what does he/she can do?

Although we can doc it, trust me, not everyone would like to read the whole doc carefully. Previously, XGBoost-JVM has the same issue, so I changed that.

wbo4958 · 2022-09-16T09:49:06Z

Hi @hcho3, what does "Pending" mean for pipelines like xgboost-ci/pr?

WeichenXu123 · 2022-09-16T12:21:26Z

python-package/xgboost/spark/model.py

+        _get_spark_session().sparkContext.parallelize([booster], 1).saveAsTextFile(
+            model_save_path


Interesting idea, but how to control the saved file name ?

Does the booster string contain "\n" character ? If yes, when loading back (by sparkContext.textFile(model_load_path), each line will become one RDD element, and these lines might be split into multiple RDD partitions)

I tested, and It is always part-00000, seems there is a pattern for the generated file according to the task id since we only have 1 partition, so the id should be 00000

Let's document the file name "part-00000" is the model json file.

and pls add a test to ensure the model json file does not contain \n character and document the reason.

Just checked the code, the file name is defined by https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L225.

override def initWriter(taskContext: NewTaskAttemptContext, splitId: Int): Unit = { val numfmt = NumberFormat.getInstance(Locale.US) numfmt.setMinimumIntegerDigits(5) numfmt.setGroupingUsed(false) val outputName = "part-" + numfmt.format(splitId) val path = FileOutputFormat.getOutputPath(getConf) val fs: FileSystem = { if (path != null) { path.getFileSystem(getConf) } else { // scalastyle:off FileSystemGet FileSystem.get(getConf) // scalastyle:on FileSystemGet } } ...

here the splitId is the TaskContext.partitionId(). In our case, there is only 1 partition, so the file name is "part-00000"

yes I know that. My point is can we customize the file name to make it more user-friendly.
Not a must though.

that's the internal behavior of pyspark, not sure if it's a good idea to rely on it.

Yeah, If you guys insist, I can use the FileSystem java API to achieve it by py4j.

Yeah, If you guys insist, I can use the FileSystem java API to achieve it by py4j.

No need to do that, it makes code hard to maintain, your current code is fine.

hcho3 · 2022-09-16T15:16:56Z

what does "Pending" mean for pipelines like xgboost-ci/pr?

The CI pipeline doesn't run until one of the admins (like me) give approval. We do this to save the CI costs.

WeichenXu123 · 2022-09-17T02:59:15Z

tests/python/test_spark/test_spark_local.py

+        bst = xgb.Booster()
+        path = glob.glob(f"{model_path}/**/model/part-00000", recursive=True)[0]
+        bst.load_model(path)
+        self.assertEqual(model.get_booster().save_raw("json"), bst.save_raw("json"))


add a test to assert model file does not include \n char.

Yeah, per my understanding, seems we don't need to do this, since if there is "\n", the assertion must be failed self.assertEqual(model.get_booster().save_raw("json"), bst.save_raw("json")) or bst.load_model(path) will fail.

trivialfis · 2022-09-18T17:50:13Z

I will leave the approval to @WeichenXu123 . Could you please add document as well? About the get_booster and your workaround for the model serialization.

wbo4958 · 2022-09-18T23:42:17Z

Sure, I will add the doc in the following PR along with how to leverage RAPIDS to accelerate xgboost pyspark.

@trivialfis @hcho3 could you trigger the CI of this PR

wbo4958 added 4 commits September 2, 2022 19:05

[pyspark] make the model saved by pyspark be compatible

93f05c2

Add load/write and some tests

768235a

Merge remote-tracking branch 'upstream/master' into model-format

4b2ce00

format

a1e1bc7

wbo4958 changed the title ~~[WIP][pyspark] make the model saved by pyspark compatible~~ [pyspark] make the model saved by pyspark compatible Sep 5, 2022

Merge remote-tracking branch 'upstream/master' into model-format

ff265ab

trivialfis reviewed Sep 14, 2022

View reviewed changes

wbo4958 added 3 commits September 15, 2022 09:59

Merge remote-tracking branch 'upstream/master' into model-format

c9e9020

resolve comments

5a37c54

Merge branch 'model-format' of github.com:wbo4958/xgboost into model-…

cc653b6

…format

WeichenXu123 reviewed Sep 15, 2022

View reviewed changes

use RDD to save json model

fdbfa22

WeichenXu123 reviewed Sep 16, 2022

View reviewed changes

WeichenXu123 reviewed Sep 17, 2022

View reviewed changes

WeichenXu123 approved these changes Sep 19, 2022

View reviewed changes

trivialfis merged commit 4f42aa5 into dmlc:master Sep 20, 2022

wbo4958 deleted the model-format branch April 23, 2024 09:26

		_get_spark_session().sparkContext.parallelize([booster], 1).saveAsTextFile(
		model_save_path

[pyspark] make the model saved by pyspark compatible #8219

[pyspark] make the model saved by pyspark compatible #8219

Conversation

wbo4958 commented Sep 2, 2022 • edited

wbo4958 commented Sep 6, 2022

wbo4958 commented Sep 8, 2022

trivialfis commented Sep 8, 2022

wbo4958 commented Sep 13, 2022 • edited

wbo4958 commented Sep 14, 2022

trivialfis commented Sep 14, 2022 • edited

wbo4958 commented Sep 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 Sep 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Sep 15, 2022

wbo4958 commented Sep 15, 2022

wbo4958 commented Sep 16, 2022

Choose a reason for hiding this comment

WeichenXu123 Sep 16, 2022 • edited

Choose a reason for hiding this comment

wbo4958 Sep 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Sep 16, 2022

Choose a reason for hiding this comment

wbo4958 Sep 17, 2022 • edited

Choose a reason for hiding this comment

trivialfis commented Sep 18, 2022

wbo4958 commented Sep 18, 2022

wbo4958 commented Sep 2, 2022 •

edited

wbo4958 commented Sep 13, 2022 •

edited

trivialfis commented Sep 14, 2022 •

edited

wbo4958 Sep 15, 2022 •

edited

WeichenXu123 Sep 16, 2022 •

edited

wbo4958 Sep 16, 2022 •

edited

wbo4958 Sep 17, 2022 •

edited