Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

FelixDuong · 2022-02-16T04:21:37Z

I have tested with stable version 1.5.2 and SNAPSHOT 1.6 and found that json compatible error in SPARK 3.2.0 when training with Pipeline Model in Spark

XGB_Model_0: org.apache.spark.ml.PipelineModel = pipeline_1040eb7c9f37 java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)' at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75) at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51) at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$5(Pipeline.scala:257) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$4(Pipeline.scala:257) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$4$adapted(Pipeline.scala:254) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:254) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) ... 74 elided

Pls, help me fixing this error to save model pipline ( I haved posted the same error about saving pipline model spark in August 2020)

The text was updated successfully, but these errors were encountered:

wbo4958 · 2022-02-21T06:00:19Z

I just tried it locally and it worked with below code

    val xgbClassificationModel = xgbClassifier.fit(xgbInput)
    xgbClassificationModel.save("/tmp/abc")
    val newModel = XGBoostClassificationModel.load("/tmp/abc")
    val df = newModel.transform(xgbInput)

Could you double check if Spark is using the xgboost 1.5.2 or 1.6 SNAPSHOT, since the issue has been fixed by #7376 and it was merged from xgboost 1.5.1

FelixDuong · 2022-02-23T08:32:42Z

I have tested recently and have the same error with json saving model

2022-02-23 15:24:38 WARN  XGBoostSpark:185 - train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.241.152, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
[15:24:49] task 0 got new rank 0                                    (0 + 1) / 1]
XGB_Model_0: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel = xgbr_4d3d9e8e1bea

scala> XGB_Model_0.save("./Model/XGB")
java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)'
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75)
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
  at org.apache.spark.ml.util.MLWritable.save(ReadWrite.scala:287)
  at org.apache.spark.ml.util.MLWritable.save$(ReadWrite.scala:287)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel.save(XGBoostRegressor.scala:211)
  ... 50 elided```


After successfully training , the model can not save in SPARK 3.2.1 with json error.... I used  

```val xbooster = new XGBoostRegressor(Map("eta"-> 0.1,"missing" -> 0,"max_depth" -> 15,"objective"->"reg:squarederror", "eval_metric"-> "mae", 
                                    "num_round" -> 50,"subsample" -> 0.8, "tree_method" -> "hist", "n_estimators" -> 100, "verbosity" -> 1,"early_stopping_rounds" -> 5, "min_child_weight"->1000))```

.... Can you help me ..?? Or just ok on old Spark Version ( 3.1)

wbo4958 · 2022-02-24T01:11:09Z

Could you double-check if you have ever put the xgboost jars into the ${SPARK_HOME}/jars before?

wbo4958 · 2022-02-24T01:13:03Z

@FelixDuong could you search the driver log with keywords "XGBoostSpark: Running XGBoost " and to check which version were you using.

FelixDuong · 2022-02-24T01:54:08Z

Oh ..I put 2 files jars 1.6 SNAPSHOT in SPARK_HOME/jars.
It's weird .. when opening the logs .."XGBoostSpark:577 - Running XGBoost 1.4.2 with ..."


2022-02-23 15:42:20 INFO  SparkContext:57 - Created broadcast 5 from rdd at DataUtils.scala:122

2022-02-23 15:42:20 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.

**2022-02-23 15:42:20 INFO  XGBoostSpark:577 - Running XGBoost 1.4.2 with parameters:**

alpha -> 0.0
min_child_weight -> 1000.0
sample_type -> uniform
.....```

FelixDuong · 2022-02-24T02:04:34Z

I re-run pipeline model with Xgboost4j-Spark 1.5.2 and the same errors:

res18: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
res19: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
res20: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
import org.apache.spark.ml.{Pipeline, PipelineModel}
pipeline: org.apache.spark.ml.Pipeline = pipeline_c40d74bd9b9f
res21: pipeline.type = pipeline_c40d74bd9b9f
2022-02-24 09:01:05 WARN  XGBoostSpark:185 - train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.241.152, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
[09:01:18] task 0 got new rank 0                                    (0 + 1) / 1]
XGB_Model_0: org.apache.spark.ml.PipelineModel = pipeline_c40d74bd9b9f  
java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)'
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75)
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
  at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$5(Pipeline.scala:257)
  at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
  at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
``

IN LOGS (with version 1.5.2 Xgboost4j-Spark)
```2022-02-24 09:01:05 INFO  SparkContext:57 - Created broadcast 5 from rdd at DataUtils.scala:122
2022-02-24 09:01:05 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.
2022-02-24 09:01:05 INFO  XGBoostSpark:577 - Running XGBoost 1.4.2 with parameters:
alpha -> 0.0
min_child_weight -> 1000.0
sample_type -> uniform
base_score -> 0.5
rabit_timeout -> -1
colsample_bylevel -> 1.0
...```

wbo4958 · 2022-02-24T02:38:00Z

@FelixDuong yeah, seems spark found the old 1.4.2 xgboost jars. Please remove the 1.4.2 xgboost jars and re-try

FelixDuong · 2022-02-24T03:02:20Z

^^ .. It works , all fine... Thanks

2022-02-24 09:58:07 INFO  MemoryStore:57 - Block broadcast_9 stored as values in memory (estimated size 191.0 KiB, free 2.8 GiB)
2022-02-24 09:58:07 INFO  MemoryStore:57 - Block broadcast_9_piece0 stored as bytes in memory (estimated size 32.9 KiB, free 2.8 GiB)
2022-02-24 09:58:07 INFO  BlockManagerInfo:57 - Added broadcast_9_piece0 in memory on 172.17.0.1:36595 (size: 32.9 KiB, free: 2.8 GiB)
2022-02-24 09:58:07 INFO  SparkContext:57 - Created broadcast 9 from rdd at DataUtils.scala:122
2022-02-24 09:58:07 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.
2022-02-24 09:58:08 INFO  XGBoostSpark:577 - Running XGBoost 1.5.2 with parameters:

trivialfis · 2022-02-24T03:05:50Z

Thank you @wbo4958 !

trivialfis closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

FelixDuong commented Feb 16, 2022

wbo4958 commented Feb 21, 2022

FelixDuong commented Feb 23, 2022 •

edited

wbo4958 commented Feb 24, 2022

wbo4958 commented Feb 24, 2022

FelixDuong commented Feb 24, 2022 •

edited

FelixDuong commented Feb 24, 2022 •

edited

wbo4958 commented Feb 24, 2022

FelixDuong commented Feb 24, 2022

trivialfis commented Feb 24, 2022

Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

Comments

FelixDuong commented Feb 16, 2022

wbo4958 commented Feb 21, 2022

FelixDuong commented Feb 23, 2022 • edited

wbo4958 commented Feb 24, 2022

wbo4958 commented Feb 24, 2022

FelixDuong commented Feb 24, 2022 • edited

FelixDuong commented Feb 24, 2022 • edited

wbo4958 commented Feb 24, 2022

FelixDuong commented Feb 24, 2022

trivialfis commented Feb 24, 2022

FelixDuong commented Feb 23, 2022 •

edited

FelixDuong commented Feb 24, 2022 •

edited

FelixDuong commented Feb 24, 2022 •

edited