Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Use quantile dmatrix. #8284

Merged
merged 21 commits into from Oct 12, 2022
Merged

Conversation

trivialfis
Copy link
Member

Close #8083 .

@trivialfis trivialfis added this to In progress in 1.7 Roadmap via automation Sep 28, 2022
@trivialfis trivialfis changed the title [WIP] [pyspark] Use quantile dmatrix. [pyspark] Use quantile dmatrix. Sep 29, 2022
@trivialfis trivialfis marked this pull request as ready for review September 29, 2022 14:11
@trivialfis
Copy link
Member Author

@WeichenXu123 @wbo4958 Please take a look when you are available.

@trivialfis
Copy link
Member Author

@wbo4958 Could you please take another look?

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 10, 2022

LGTM

1.7 Roadmap automation moved this from In progress to Reviewer approved Oct 10, 2022
@trivialfis
Copy link
Member Author

Apologies for the new changes. For some reason, the pytest mark doesn't work with test cases derived from the python unittest module.


Parameters
----------
iterator :
Pyspark partition iterator.
feature_cols:
A sequence of feqture names, used only when rapids plugin is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feqture -> feature. this parameter can be used even without rapid plugin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really done any test for it and it will likely trigger an assert error. We have DMatrix and QuantileDMatrix to support, I will leave that to the next release.

@@ -228,6 +260,10 @@ def append_dqm(part: pd.DataFrame, name: str, is_valid: bool) -> None:

def make(values: Dict[str, List[np.ndarray]], kwargs: Dict[str, Any]) -> DMatrix:
if len(values) == 0:
get_logger("XGBoostPySpark").warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, the warning is only printed for empty training data. while this PR also prints it for validation data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still good to have a warning, the support is "best effort" and is not yet extensively tested. If something goes wrong at least users will have some clues to debug.

Copy link
Contributor

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. only some minor nits.

@trivialfis trivialfis merged commit 97a5b08 into dmlc:master Oct 12, 2022
1.7 Roadmap automation moved this from Reviewer approved to Done Oct 12, 2022
@trivialfis trivialfis deleted the pyspark-qdm branch October 12, 2022 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Apply dmatrix iteration iterface in PySpark xgboost and support external memory mode
3 participants