Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Cleanup data processing. #8344

Merged
merged 5 commits into from Oct 18, 2022
Merged

Conversation

trivialfis
Copy link
Member

  • Enable additional combinations of ctor parameters.
  • Unify procedures for QuantileDMatrix and DMatrix.

Close #8341

@wbo4958 Could you please share why is this necessary? Is it possible to use the normal features column with gpu_hist at least in theory (maybe with lower initialization performance)?

if not self.getOrDefault(self.use_gpu):

* Enable additional combinations of ctor parameters.
* Unify procedures for QuantileDMatrix and DMatrix.
@@ -208,10 +208,14 @@ def create_dmatrix_from_partitions( # pylint: disable=too-many-arguments

def append_m(part: pd.DataFrame, name: str, is_valid: bool) -> None:
nonlocal n_features
if name in part.columns and part[name].shape[0] > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why get rid of the check "part[name].shape[0] > 0"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty data tests are passing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty partition bug seems to be fixed in the nightly spark build. I couldn't reproduce the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? really? I will check it, please hold on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must add check part[feature_cols].shape[0] > 0, or else, stack_series will throw exception. The latest change is ok.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 17, 2022

one comment.

@trivialfis
Copy link
Member Author

@rongou
Copy link
Contributor

rongou commented Oct 17, 2022

Not sure, maybe the server didn't have the time to start? The CI instances may be pretty overloaded. I wonder if we should add a wait in the test SetUp.

@rongou
Copy link
Contributor

rongou commented Oct 17, 2022

Sent PR #8351 that might help with this failure.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 17, 2022

Could you please share why is this necessary? Is it possible to use the normal features column with gpu_hist at least in theory (maybe with lower initialization performance)?

@trivialfis Once databricks suggested only enabling feature_cols when use_gpu is enabled. I think it's ok to enable it right now..

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 18, 2022

LGTM

@trivialfis
Copy link
Member Author

trivialfis commented Oct 18, 2022

Once databricks suggested only enabling feature_cols when use_gpu is enabled. I think it's ok to enable it right now..

We will have to leave it to the next release. Making extensive tests for various combinations of parameters is not trivial.

@trivialfis trivialfis merged commit 3901f5d into dmlc:master Oct 18, 2022
@trivialfis trivialfis deleted the pyspark-dmatrix branch October 18, 2022 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pyspark] Document for the use of GPU dependencies.
4 participants