Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] disable repartition_random_shuffle by default #8283

Merged
merged 2 commits into from Sep 29, 2022

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Sep 28, 2022

This PR resolved the issue1 xgboost detected this parameter is not used in native.

Parameters: { "repartition_random_shuffle" } are not used.

and this PR disabled repartition_random_shuffle by default and will give some prompt when detecting an empty partition. The severe data skew mentioned in #8221 is really kind of rare case, so I don't think we need to change the default repartition behavior to hash partitioning from round robin partitioning. Besides, the hash partitioning will introduce an extra "project" physical plan, see

  • repartition_random_shuffle=False
== Physical Plan ==
Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#94]
+- *(1) Project [cast(class#4 as float) AS label#20, cast(sepal_length#0 as float) AS sepal_length#10, cast(sepal_width#1 as float) AS sepal_width#11, cast(petal_length#2 as float) AS petal_length#12, cast(petal_width#3 as float) AS petal_width#13]
   +- *(1) ColumnarToRow
      +- FileScan parquet [sepal_length#0,sepal_width#1,petal_length#2,petal_width#3,class#4] Batched: true, DataFilters: [], Format: Parquet, Location: xxx
  • repartition_random_shuffle=True
== Physical Plan ==
*(2) Project [label#20, sepal_length#10, sepal_width#11, petal_length#12, petal_width#13]
+- Exchange hashpartitioning(_nondeterministic#26, 1), REPARTITION_WITH_NUM, [id=#92]
   +- *(1) Project [cast(class#4 as float) AS label#20, cast(sepal_length#0 as float) AS sepal_length#10, cast(sepal_width#1 as float) AS sepal_width#11, cast(petal_length#2 as float) AS petal_length#12, cast(petal_width#3 as float) AS petal_width#13, rand(1) AS _nondeterministic#26]
      +- *(1) ColumnarToRow
         +- FileScan parquet [sepal_length#0,sepal_width#1,petal_length#2,petal_width#3,class#4] Batched: true, DataFilters: [], Format: Parquet, Location: xxx

@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 28, 2022

@trivialfis any idea about the error?

Run python tests/ci_build/lint_python.py --format=0 --type-check=1 --pylint=0
xgboost/spark/data.py:12: error: Module "xgboost.spark.utils" has no attribute "get_logger"
Found 1 error in 1 file (checked 24 source files)
Error: Process completed with exit code 255.

@trivialfis
Copy link
Member

Use this instead:

from .utils import get_logger   # type: ignore

The utils.py is not properly annotated.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 28, 2022

Interesting,

I locally run this command, and there are some other failures for other files.

$ python tests/ci_build/lint_python.py --format=0 --type-check=1 --pylint=0
xgboost/core.py:451: error: Incompatible types in assignment (expression has type "BaseException", variable has type "Optional[Exception]")
xgboost/plotting.py:112: error: Unsupported operand types for + ("List[float]" and "int")
xgboost/plotting.py:112: note: Left operand is of type "Union[float, List[float]]"
xgboost/plotting.py:121: error: Unsupported operand types for * ("List[float]" and "float")
xgboost/plotting.py:121: note: Left operand is of type "Union[float, List[float]]"
xgboost/dask.py:842: error: Incompatible types in assignment (expression has type "None", variable has type "Tuple[str, int]")
xgboost/dask.py:850: error: Incompatible types in assignment (expression has type "None", variable has type "str")
xgboost/dask.py:1735: error: "sync" gets multiple values for keyword argument "asynchronous"
Found 6 errors in 3 files (checked 24 source files)

And I saw the xgboost/spark/core.py also imports get_logger without type: ignore, see https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/spark/core.py#L59-L70

@wbo4958 wbo4958 mentioned this pull request Sep 28, 2022
@trivialfis
Copy link
Member

Most of the spark modules have this line:


As a result, the type checker cannot work properly with these modules.

As for the other errors, a likely cause is missing dependencies in your local env.

@trivialfis trivialfis added this to In progress in 1.7 Roadmap via automation Sep 29, 2022
@trivialfis trivialfis merged commit c91fed0 into dmlc:master Sep 29, 2022
1.7 Roadmap automation moved this from In progress to Done Sep 29, 2022
@wbo4958 wbo4958 deleted the repartition branch April 23, 2024 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants