Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DON'T MERGE] Try to replace all json4s with Jackson #37604

Closed
wants to merge 39 commits into from

Conversation

LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

@LuciferYang LuciferYang marked this pull request as draft August 22, 2022 02:40
@LuciferYang LuciferYang changed the title [DON'T MERGE] Try to replace all json4s with Jackson [DON'T MERGE] Try to replace all json4s with Jackson Aug 22, 2022
@HyukjinKwon
Copy link
Member

cc @JoshRosen FYI

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 22, 2022

@JoshRosen I am not sure whether this draft is helpful for future work, but I hope it is useful to a certain extent.

This draft pr does not focus on forward compatibility, so I use Jackson JsonNode instead of all Json4s JValue, includes method parameters type and return value type, also includes objects used to serialize and deserialize Json. The test code still using json4s to test compatibility and all test should passed.

The change involves 5 modules core, catalyst, sql, mllib, kafka, except that sql is directly used for Row.jsonValue in catalyst, other modules are relatively independent.

For exporting HTTP APIs(org.apache.spark.deploy.JsonProtocol), JsonNode is also used, and jsonResponderToServlet method in JettyUtils is adapted. Of course, we can also return a custom JsonResult object instead of relying on JsonNode

A problem found in the rewriting process is that Json4s has JNothing, but Jackson does not, I used MissingNode in Jackson instead and made special processing, for example:

if (!jsonNode.isMissingNode) {
  node.set[JsonNode](name, jsonNode)
}

mima checks the following forward incompatibilities:

 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.types.DataType#JSortedObject.unapplySeq"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.expressions.MutableAggregationBuffer.jsonValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeMapToJValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeDoubleToJValue"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueDecode"),
 ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueEncode"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.tree.model.TreeEnsembleModel#SaveLoadV1_0.readMetadata")

This is similar to the problem described in SPARK-39658, JValue is directly used in the API.

@LuciferYang
Copy link
Contributor Author

It seems that the main problem is that JValue is used as an input parameter or a return value, which will cause some forward incompatibility. I'm not sure whether we will change these APIs.

If the above problem is not a problem, I think spark may can replace Json4s with Jackson

@LuciferYang
Copy link
Contributor Author

2ea9dcf add a simple benchmark, will update result later

* Results will be written to "benchmarks/JsonSerDeBenchmark-results.txt".
* }}}
* */
object JsonSerDeBenchmark extends BenchmarkBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test to Json with out nested:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Json4s                                          111            170          58          0.9        1105.5       1.0X
Use Scala object                                    175            177           2          0.6        1748.0       0.6X
Use JsonNode                                         60             61           2          1.7         597.7       1.8X
Use Json Generator                                   55             56           1          1.8         552.8       2.0X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test to Json with nested:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Json4s                                         1574           1575           1          0.1       15739.5       1.0X
Use Scala object                                    395            398           2          0.3        3952.5       4.0X
Use JsonNode                                        285            288           2          0.4        2854.5       5.5X
Use Json Generator                                  151            153           2          0.7        1511.1      10.4X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test from Json without nested:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Json4s                                          103            104           2          1.0        1032.5       1.0X
Use Jackson                                          95             96           1          1.0         954.2       1.1X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test from Json with nested:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Json4s                                         1258           1260           2          0.1       12584.5       1.0X
Use Jackson                                         539            541           2          0.2        5394.5       2.3X

From the results of JsonSerDeBenchmark:

  • Jackson is faster than Json4s, and Jackson has greater advantages when the data is nested
  • Using JsonGenerator for serialization is more efficient than building a JsonNode first and then serializing it

@plokhotnyuk
Copy link

plokhotnyuk commented Sep 7, 2022

I'm happy to see this PR!

Jackson is much safer than json4s

@LuciferYang Could you, please, check jsoniter-scala with your benchmarks if it worth to have an official integration with Spark ;)

@LuciferYang
Copy link
Contributor Author

@plokhotnyuk Let me learn about jsoniter-scala first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants