Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make QuantileDMatrix default to sklearn esitmators. #8220

Merged
merged 11 commits into from Sep 13, 2022

Conversation

trivialfis
Copy link
Member

Stacked on #8215 .

@trivialfis
Copy link
Member Author

trivialfis commented Sep 2, 2022

When the tree method is selected to be hist or gpu_hist, we will use QuantileDMatrix for both training and evaluation in the sklearn interface.

Start working on sparse data.

Fix race.

Remove check.

Merge functions.

Cleanup.

Cleanup.

Start writing tests.

Fix.

comp column.

Python test.

lint.

lint.

Fix.

Cleanup.

Avoid binary search.

Use quantile dmatrix by default in sklearn interface.

dask as well.

Fix max_bin.

Fix empty dmatrix for CPU.

Fix GPU version.

Fix empty DMatrix.

pylint.
@trivialfis trivialfis changed the title [WIP] Make QuantileDMatrix default to sklearn esitmators. Make QuantileDMatrix default to sklearn esitmators. Sep 8, 2022
client: Optional["distributed.Client"],
tree_method: Optional[str],
max_bin: Optional[int],
**kwargs: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that tree_method and max_bin will no longer be in the kwargs dictionary. Can you make sure this will not cause undesirable behavior? For example, is kwargs passed to xgb.train?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training parameters are obtained via params = self.get_xgb_params(), which you can find in the async def _fit_async function.

return QuantileDMatrix(
**kwargs, ref=ref, nthread=self.n_jobs, max_bin=self.max_bin
)
except TypeError: # `QuantileDMatrix` supports lesser types than DMatrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which cases are not supported out of curiosity? Just wondering if this is going to run into the exception regularly in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datatable, CSC, arrow. Also, the DMatrix has a dispatcher to convert unknown types to scipy csr.

}

batch_param_ = BatchParam{d, max_bin};
batch_param_.sparse_thresh = 0.2; // default from TrainParam
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the default, this hardcoded value will be forgotten.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to obtain the value from train param.

* Prepare for improving Windows networking compatibility.

* Include dmlc filesystem indirectly as dmlc/filesystem.h includes windows.h, which
  conflicts with winsock2.h
* Define `NOMINMAX` conditionally.
* Link the winsock library when mysys32 is used.
* Add config file for read the doc.
Start working on sparse data.

Fix race.

Remove check.

Merge functions.

Cleanup.

Cleanup.

Start writing tests.

Fix.

comp column.

Python test.

lint.

lint.

Fix.

Cleanup.

Avoid binary search.

Use quantile dmatrix by default in sklearn interface.

dask as well.

Fix max_bin.

Fix empty dmatrix for CPU.

Fix GPU version.

Fix empty DMatrix.

pylint.
@trivialfis trivialfis merged commit bdf2650 into dmlc:master Sep 13, 2022
@trivialfis trivialfis deleted the dft-quantile-dmatrix branch September 13, 2022 05:52
@trivialfis trivialfis mentioned this pull request Sep 14, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants