[pyspark] sort qid for SparkRanker #8497

wbo4958 · 2022-11-30T04:29:32Z

wbo4958 · 2022-11-30T04:30:51Z

@WeichenXu123 @trivialfis please help to review it.

WeichenXu123 · 2022-11-30T09:20:34Z

python-package/xgboost/spark/core.py

@@ -729,6 +729,10 @@ def _fit(self, dataset):
            else:
                dataset = dataset.repartition(num_workers)

+        if self.isDefined(self.qid_col) and self.getOrDefault(self.qid_col):
+            # XGBoost requires qid to be sorted for each partition
+            dataset = dataset.sortWithinPartitions(alias.qid)


Nit: add ascending=True explicitly.

WeichenXu123 · 2022-11-30T09:22:15Z

tests/test_distributed/test_with_spark/test_spark_local.py

+                (Vectors.sparse(3, {1: 8.0, 2: 9.5}), 2, 1),
+                (Vectors.dense(1.0, 2.0, 3.0), 0, 0),
+                (Vectors.dense(4.0, 5.0, 6.0), 1, 0),
+                (Vectors.dense(9.0, 4.0, 8.0), 2, 0),


nit: do we need hardcode so long data list ?
we can hardcode 4 rows and use [ ... ] * 100 instead.

trivialfis · 2022-11-30T09:53:58Z

tests/test_distributed/test_with_spark/test_spark_local.py

+        ranker = SparkXGBRanker(qid_col="qid", num_workers=2)
+        assert ranker.getOrDefault(ranker.objective) == "rank:pairwise"
+        model = ranker.fit(self.ranker_df_train_1)
+        model.transform(self.ranker_df_test).collect()


what's the purpose of this test?

to test if the SparkRanker will throw exception

trivialfis · 2022-11-30T09:54:56Z

tests/test_distributed/test_with_spark/test_spark_local.py

+        )
+        self.ranker_df_train_1 = self.session.createDataFrame(
+            [
+                (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 0, 9),


How did you produce this data and the expected result? Please try not to use hardcoded results.

yeah, the qid is the descending order. without the fix, it will throw exception ../src/data/data.cc:486: Check failed: non_dec: qid must be sorted in non-decreasing order along with data.

wbo4958 · 2022-12-01T22:17:48Z

@hcho3 please help to merge it. thx

trivialfis · 2022-12-02T06:58:55Z

@wbo4958 Could you please change the tests to NOT use hardcoded results?

trivialfis · 2022-12-02T06:59:14Z

#8497 (comment)

wbo4958 · 2022-12-02T08:13:10Z

#8497 (comment)

Hi @trivialfis, For this case, the test I added is to check if the pyspark application will be crashed, so it's ok, I think, to hardcode the data. Since I think the data is so straightforward to show the scenario which can crash the process.

trivialfis · 2022-12-02T10:33:41Z

tests/test_distributed/test_with_spark/test_spark_local.py

+        pred_result = model.transform(self.ranker_df_test).collect()
+
+        for row in pred_result:
+            assert np.isclose(row.prediction, row.expected_prediction, rtol=1e-3)


@wbo4958 This is not only checking exception.

This test is moved from https://github.com/dmlc/xgboost/pull/8497/files#diff-3b3ca1f9bd10767b61c3eab170a027b67408881dcf57e4e992c2caa47d660ff5L386-L407, I didn't change it

Ah ... That's a headache, I'm blocked by these tests and don't know how to recreate them...

yes, we can have the following PR to refactor these tests by not hardcoding them

* [pyspark] sort qid for SparkRandker * resolve comments

* [pyspark] sort qid for SparkRandker * resolve comments Co-authored-by: Bobby Wang <wbo4958@gmail.com>

[pyspark] sort qid for SparkRandker

9d65008

wbo4958 mentioned this pull request Nov 30, 2022

[pyspark] SparkXGBRanker does not work on dataframe with multiple partitions #8491

Closed

wbo4958 mentioned this pull request Nov 30, 2022

1.7.2 Patch Release #8492

Closed

9 tasks

hcho3 changed the title ~~[pyspark] sort qid for SparkRandker~~ [pyspark] sort qid for SparkRanker Nov 30, 2022

hcho3 approved these changes Nov 30, 2022

View reviewed changes

WeichenXu123 reviewed Nov 30, 2022

View reviewed changes

WeichenXu123 approved these changes Nov 30, 2022

View reviewed changes

trivialfis reviewed Nov 30, 2022

View reviewed changes

resolve comments

30d86e5

WeichenXu123 approved these changes Nov 30, 2022

View reviewed changes

hcho3 merged commit 8e41ad2 into dmlc:master Dec 2, 2022

wbo4958 deleted the ranker branch December 2, 2022 01:09

trivialfis reviewed Dec 2, 2022

View reviewed changes

trivialfis added a commit that referenced this pull request Dec 6, 2022

[pyspark] sort qid for SparkRanker (#8497) (#8555)

60a8c8e

* [pyspark] sort qid for SparkRandker * resolve comments Co-authored-by: Bobby Wang <wbo4958@gmail.com>

[pyspark] sort qid for SparkRanker #8497

[pyspark] sort qid for SparkRanker #8497

Conversation

wbo4958 commented Nov 30, 2022

Uh oh!

wbo4958 commented Nov 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbo4958 commented Dec 1, 2022

Uh oh!

trivialfis commented Dec 2, 2022

Uh oh!

trivialfis commented Dec 2, 2022

Uh oh!

wbo4958 commented Dec 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis Nov 30, 2022 •

edited

Loading