[pyspark] disable repartition_random_shuffle by default #8283

wbo4958 · 2022-09-28T06:45:33Z

This PR resolved the issue1 xgboost detected this parameter is not used in native.

Parameters: { "repartition_random_shuffle" } are not used.

and this PR disabled repartition_random_shuffle by default and will give some prompt when detecting an empty partition. The severe data skew mentioned in #8221 is really kind of rare case, so I don't think we need to change the default repartition behavior to hash partitioning from round robin partitioning. Besides, the hash partitioning will introduce an extra "project" physical plan, see

repartition_random_shuffle=False

== Physical Plan ==
Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#94]
+- *(1) Project [cast(class#4 as float) AS label#20, cast(sepal_length#0 as float) AS sepal_length#10, cast(sepal_width#1 as float) AS sepal_width#11, cast(petal_length#2 as float) AS petal_length#12, cast(petal_width#3 as float) AS petal_width#13]
   +- *(1) ColumnarToRow
      +- FileScan parquet [sepal_length#0,sepal_width#1,petal_length#2,petal_width#3,class#4] Batched: true, DataFilters: [], Format: Parquet, Location: xxx

repartition_random_shuffle=True

== Physical Plan ==
*(2) Project [label#20, sepal_length#10, sepal_width#11, petal_length#12, petal_width#13]
+- Exchange hashpartitioning(_nondeterministic#26, 1), REPARTITION_WITH_NUM, [id=#92]
   +- *(1) Project [cast(class#4 as float) AS label#20, cast(sepal_length#0 as float) AS sepal_length#10, cast(sepal_width#1 as float) AS sepal_width#11, cast(petal_length#2 as float) AS petal_length#12, cast(petal_width#3 as float) AS petal_width#13, rand(1) AS _nondeterministic#26]
      +- *(1) ColumnarToRow
         +- FileScan parquet [sepal_length#0,sepal_width#1,petal_length#2,petal_width#3,class#4] Batched: true, DataFilters: [], Format: Parquet, Location: xxx

wbo4958 · 2022-09-28T07:34:23Z

@trivialfis any idea about the error?

Run python tests/ci_build/lint_python.py --format=0 --type-check=1 --pylint=0
xgboost/spark/data.py:12: error: Module "xgboost.spark.utils" has no attribute "get_logger"
Found 1 error in 1 file (checked 24 source files)
Error: Process completed with exit code 255.

trivialfis · 2022-09-28T07:43:21Z

Use this instead:

from .utils import get_logger   # type: ignore

The utils.py is not properly annotated.

wbo4958 · 2022-09-28T08:13:49Z

Interesting,

I locally run this command, and there are some other failures for other files.

$ python tests/ci_build/lint_python.py --format=0 --type-check=1 --pylint=0
xgboost/core.py:451: error: Incompatible types in assignment (expression has type "BaseException", variable has type "Optional[Exception]")
xgboost/plotting.py:112: error: Unsupported operand types for + ("List[float]" and "int")
xgboost/plotting.py:112: note: Left operand is of type "Union[float, List[float]]"
xgboost/plotting.py:121: error: Unsupported operand types for * ("List[float]" and "float")
xgboost/plotting.py:121: note: Left operand is of type "Union[float, List[float]]"
xgboost/dask.py:842: error: Incompatible types in assignment (expression has type "None", variable has type "Tuple[str, int]")
xgboost/dask.py:850: error: Incompatible types in assignment (expression has type "None", variable has type "str")
xgboost/dask.py:1735: error: "sync" gets multiple values for keyword argument "asynchronous"
Found 6 errors in 3 files (checked 24 source files)

And I saw the xgboost/spark/core.py also imports get_logger without type: ignore, see https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/spark/core.py#L59-L70

trivialfis · 2022-09-28T17:59:49Z

Most of the spark modules have this line:

xgboost/python-package/xgboost/spark/core.py

Line 1 in 6925b22

# type: ignore

As a result, the type checker cannot work properly with these modules.

As for the other errors, a likely cause is missing dependencies in your local env.

[pyspark] disable repartition_random_shuffle by default

6d43864

fix pylint issue

4349d65

wbo4958 mentioned this pull request Sep 28, 2022

1.7.0 Release Roadmap #8282

Closed

WeichenXu123 approved these changes Sep 28, 2022

View reviewed changes

trivialfis approved these changes Sep 28, 2022

View reviewed changes

trivialfis added this to In progress in 1.7 Roadmap via automation Sep 29, 2022

trivialfis merged commit c91fed0 into dmlc:master Sep 29, 2022

1.7 Roadmap automation moved this from In progress to Done Sep 29, 2022

wbo4958 deleted the repartition branch April 23, 2024 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] disable repartition_random_shuffle by default #8283

[pyspark] disable repartition_random_shuffle by default #8283

wbo4958 commented Sep 28, 2022 •

edited

wbo4958 commented Sep 28, 2022

trivialfis commented Sep 28, 2022

wbo4958 commented Sep 28, 2022

trivialfis commented Sep 28, 2022

[pyspark] disable repartition_random_shuffle by default #8283

[pyspark] disable repartition_random_shuffle by default #8283

Conversation

wbo4958 commented Sep 28, 2022 • edited

wbo4958 commented Sep 28, 2022

trivialfis commented Sep 28, 2022

wbo4958 commented Sep 28, 2022

trivialfis commented Sep 28, 2022

wbo4958 commented Sep 28, 2022 •

edited