Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xgboost4j-Spark 1.5.2 & 1.6 SNAPSHOT have not yet fixed json error in Spark 3.2.0 #7663

Closed
FelixDuong opened this issue Feb 16, 2022 · 9 comments

Comments

@FelixDuong
Copy link

I have tested with stable version 1.5.2 and SNAPSHOT 1.6 and found that json compatible error in SPARK 3.2.0 when training with Pipeline Model in Spark

XGB_Model_0: org.apache.spark.ml.PipelineModel = pipeline_1040eb7c9f37 java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)' at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75) at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51) at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$5(Pipeline.scala:257) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$4(Pipeline.scala:257) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$4$adapted(Pipeline.scala:254) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:254) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) ... 74 elided

Pls, help me fixing this error to save model pipline ( I haved posted the same error about saving pipline model spark in August 2020)

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 21, 2022

I just tried it locally and it worked with below code

    val xgbClassificationModel = xgbClassifier.fit(xgbInput)
    xgbClassificationModel.save("/tmp/abc")
    val newModel = XGBoostClassificationModel.load("/tmp/abc")
    val df = newModel.transform(xgbInput)

Could you double check if Spark is using the xgboost 1.5.2 or 1.6 SNAPSHOT, since the issue has been fixed by #7376 and it was merged from xgboost 1.5.1

@FelixDuong
Copy link
Author

FelixDuong commented Feb 23, 2022

I have tested recently and have the same error with json saving model

2022-02-23 15:24:38 WARN  XGBoostSpark:185 - train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.241.152, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
[15:24:49] task 0 got new rank 0                                    (0 + 1) / 1]
XGB_Model_0: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel = xgbr_4d3d9e8e1bea

scala> XGB_Model_0.save("./Model/XGB")
java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)'
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75)
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
  at org.apache.spark.ml.util.MLWritable.save(ReadWrite.scala:287)
  at org.apache.spark.ml.util.MLWritable.save$(ReadWrite.scala:287)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel.save(XGBoostRegressor.scala:211)
  ... 50 elided```


After successfully training , the model can not save in SPARK 3.2.1 with json error.... I used  

```val xbooster = new XGBoostRegressor(Map("eta"-> 0.1,"missing" -> 0,"max_depth" -> 15,"objective"->"reg:squarederror", "eval_metric"-> "mae", 
                                    "num_round" -> 50,"subsample" -> 0.8, "tree_method" -> "hist", "n_estimators" -> 100, "verbosity" -> 1,"early_stopping_rounds" -> 5, "min_child_weight"->1000))```

.... Can you help me ..?? Or just ok on old Spark Version ( 3.1)

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 24, 2022

Could you double-check if you have ever put the xgboost jars into the ${SPARK_HOME}/jars before?

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 24, 2022

@FelixDuong could you search the driver log with keywords "XGBoostSpark: Running XGBoost " and to check which version were you using.

@FelixDuong
Copy link
Author

FelixDuong commented Feb 24, 2022

Oh ..I put 2 files jars 1.6 SNAPSHOT in SPARK_HOME/jars.
It's weird .. when opening the logs .."XGBoostSpark:577 - Running XGBoost 1.4.2 with ..."


2022-02-23 15:42:20 INFO  SparkContext:57 - Created broadcast 5 from rdd at DataUtils.scala:122

2022-02-23 15:42:20 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.

**2022-02-23 15:42:20 INFO  XGBoostSpark:577 - Running XGBoost 1.4.2 with parameters:**

alpha -> 0.0
min_child_weight -> 1000.0
sample_type -> uniform
.....```

@FelixDuong
Copy link
Author

FelixDuong commented Feb 24, 2022

I re-run pipeline model with Xgboost4j-Spark 1.5.2 and the same errors:

res18: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
res19: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
res20: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_fd1aecb58785
import org.apache.spark.ml.{Pipeline, PipelineModel}
pipeline: org.apache.spark.ml.Pipeline = pipeline_c40d74bd9b9f
res21: pipeline.type = pipeline_c40d74bd9b9f
2022-02-24 09:01:05 WARN  XGBoostSpark:185 - train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.241.152, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
[09:01:18] task 0 got new rank 0                                    (0 + 1) / 1]
XGB_Model_0: org.apache.spark.ml.PipelineModel = pipeline_c40d74bd9b9f  
java.lang.NoSuchMethodError: 'org.json4s.JsonDSL$JsonAssoc org.json4s.JsonDSL$.pair2Assoc(scala.Tuple2, scala.Function1)'
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:75)
  at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel$XGBoostRegressionModelWriter.saveImpl(XGBoostRegressor.scala:454)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
  at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$5(Pipeline.scala:257)
  at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
  at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
``

IN LOGS (with version 1.5.2 Xgboost4j-Spark)
```2022-02-24 09:01:05 INFO  SparkContext:57 - Created broadcast 5 from rdd at DataUtils.scala:122
2022-02-24 09:01:05 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.
2022-02-24 09:01:05 INFO  XGBoostSpark:577 - Running XGBoost 1.4.2 with parameters:
alpha -> 0.0
min_child_weight -> 1000.0
sample_type -> uniform
base_score -> 0.5
rabit_timeout -> -1
colsample_bylevel -> 1.0
...```

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 24, 2022

@FelixDuong yeah, seems spark found the old 1.4.2 xgboost jars. Please remove the 1.4.2 xgboost jars and re-try

@FelixDuong
Copy link
Author

^^ .. It works , all fine... Thanks

2022-02-24 09:58:07 INFO  MemoryStore:57 - Block broadcast_9 stored as values in memory (estimated size 191.0 KiB, free 2.8 GiB)
2022-02-24 09:58:07 INFO  MemoryStore:57 - Block broadcast_9_piece0 stored as bytes in memory (estimated size 32.9 KiB, free 2.8 GiB)
2022-02-24 09:58:07 INFO  BlockManagerInfo:57 - Added broadcast_9_piece0 in memory on 172.17.0.1:36595 (size: 32.9 KiB, free: 2.8 GiB)
2022-02-24 09:58:07 INFO  SparkContext:57 - Created broadcast 9 from rdd at DataUtils.scala:122
2022-02-24 09:58:07 INFO  FileSourceScanExec:57 - Planning scan with bin packing, max size: 18963709 bytes, open cost is considered as scanning 4194304 bytes.
2022-02-24 09:58:08 INFO  XGBoostSpark:577 - Running XGBoost 1.5.2 with parameters:

@trivialfis
Copy link
Member

Thank you @wbo4958 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants