Support dataframe data format in native XGBoost. #9828

trivialfis · 2023-11-30T15:11:26Z

Implement a columnar adapter.
Refactor Python pandas handling code to avoid converting into a single numpy array.
Add support in R for transforming columns.

This is not yet as efficient as we would like since we can't handle missing value indicators for each column. I will leave that as a future to-do instead.

Categorical data should work with R factor now, minus the uncertainty about how we should suggest people to use a consistent encoder.

related: #9810 .

trivialfis · 2023-12-07T20:08:16Z

@david-cortes Could you please help take a look into the R interface?

R-package/R/xgb.DMatrix.R

david-cortes

Left a few minor comments. I guess I could add int64 support later.

R-package/src/xgboost_R.cc

python-package/xgboost/data.py

R-package/R/xgb.DMatrix.R

trivialfis · 2023-12-08T12:01:45Z

For some reason, even if just creating DMatrix (simplest possible task), the np array created from arrow extension is incredibly inefficient. I have checked the resulting DMatrices have the same hash. In addition, they run in the exact same code path inside libxgboost.

trivialfis · 2023-12-08T12:29:54Z

Got it, it's caused by meta info instead of the covariate.

david-cortes

Left a few more comments. I'm curious in particular about how these types interplay with feature types.

R-package/R/xgb.DMatrix.R

david-cortes · 2023-12-08T15:12:22Z

R-package/R/xgb.DMatrix.R

+      }
+    }))
+    ## as.data.frame somehow converts integer/logical into real.
+    data <- as.data.frame(sapply(data, function(x) {


Does the result here need to be a data.frame? Maybe you could just call lapply, as I think it would avoid one list copy. Note that, per the other comments I'm leaving here, if integer columns have any missing values, they might need to be coerced through e.g. as.numeric.

Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type?

I think as.data.frame does the coercing for me.

Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type

Ideally, we would like to handle different types of columns independently without any coercing, and hence without any data copying. However, at the moment only cuDF input can be consumed in this way due to missing value handling. R uses sentinel values to indicate missing/NA, while XGBoost can't have more than one missing value indicator at the moment. As a result, a DF containing a float column and an integer column with NAs can confuse XGBoost what value it should eliminate. Is it NaN or NA(int)?

The cuDF uses arrow IPC format as its memory layout and exposes them as part of the API, missing values are represented by a bitmask, we can handle all the columns without any transformation (except for categorical encoding).

david-cortes · 2023-12-08T15:15:19Z

R-package/R/xgb.DMatrix.R

+      } else if (is.integer(x)) {
+        "int"
+      } else if (is.logical(x)) {
+        "i"


Question there: I see this is also being used in the pandas adapter.

In R, boolean (logical) types are represented as C int, where possible values are FALSE (zero), NA (-INT_MAX), and TRUE (everything else), while python's bool type has only True and False.

I see you mention in a comment later that these get converted to numeric type, but the C++ code still checks for integer/logical-typed columns.

What would happen with these missing values encoded as -INT_MAX if the columns are supplied in their original types?

As suggested by the comments in C++, those C++ handling code is not used but is more or less a reminder that we should try to avoid data transformation in R. I think the previous reply might help with the -INT_MAX part.

I can remove the code if it's hindering readability

I can remove the code if it's hindering readability

I actually was thinking something along the lines that using sapply instead of lapply + unlist would avoid one list copy operation. Haven't checked this hypothesis though. I don't think the code is unreadable or hard to understand.

Thank you for the suggestion, I removed the unlist as suggested in #9828 (comment) .

R-package/src/xgboost_R.cc

david-cortes · 2023-12-08T15:21:12Z

include/xgboost/c_api.h

+ * @brief Create a DMatrix from columnar data. (table)
+ *
+ * @param data   See @ref XGBoosterPredictFromColumnar for details.
+ * @param config See @ref XGDMatrixCreateFromDense for details.


Something I'm wondering here: if this config already conveys the information about whether a column has integer type, is it actually needed to make a distinction between q and int in feature_types?

The columns don't convey the information accurately since we need to do some transformations before passing them into XGBoost. For instance, if a column is integer with missing values, we have to use float with NaN as an approximate.

trivialfis · 2023-12-08T18:16:50Z

I'm curious in particular about how these types interplay with feature types.

Other than the c type (for categorical), others don't have any practical implication on how the tree is built and are only for nicer plotting.

trivialfis · 2023-12-08T19:25:45Z

cc @david-cortes DF for label and base margin is still not yet supported. These are useful for multi-output/multi-label problems. But we can work on them later.

- Implement a columnar adapter. - Refactor Python pandas handling code to avoid converting into numpy. - Add support in R for transforming columns.

trivialfis · 2023-12-11T00:42:02Z

Added the cnames configuration back for the matrix.

trivialfis · 2023-12-11T00:46:52Z

@david-cortes Could you please help take another look?

david-cortes · 2023-12-11T22:57:12Z

R-package/R/xgb.DMatrix.R

@@ -58,19 +61,28 @@ xgb.DMatrix <- function(
  qid = NULL,
  label_lower_bound = NULL,
  label_upper_bound = NULL,
-  feature_weights = NULL
+  feature_weights = NULL,
+  enable_categorical = FALSE


Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo, for example?

If so, how about renaming it to 'autodetect_categorical' or something along those lines? (both in the R and Python interfaces) Would also be ideal to describe a bit more of it in the docs (e.g. that it's only for data frames).

Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo, for example?

Correct. It's more or less a guard to prevent surprise since XGBoost didn't accept categorical data before, which might cause issues in silence if we suddenly accept it.

I don't have strong preference on the naming, we have an introductory document for cat data in the tutorials, feel free to add additional explanation.

R-package/R/xgb.DMatrix.R

david-cortes · 2023-12-11T23:00:43Z

LGTM. Left two small comments.

trivialfis force-pushed the data-columnar branch from 4008935 to fa48de4 Compare December 6, 2023 20:31

RAMitchell approved these changes Dec 7, 2023

View reviewed changes

trivialfis commented Dec 7, 2023

View reviewed changes

R-package/R/xgb.DMatrix.R Show resolved Hide resolved

david-cortes reviewed Dec 7, 2023

View reviewed changes

david-cortes reviewed Dec 8, 2023

View reviewed changes

trivialfis mentioned this pull request Dec 11, 2023

[R] Move all DMatrix fields to function arguments #9862

Merged

Support dataframe data format in native XGBoost.

7bf3ccf

- Implement a columnar adapter. - Refactor Python pandas handling code to avoid converting into numpy. - Add support in R for transforming columns.

trivialfis force-pushed the data-columnar branch from 50eda8e to 7bf3ccf Compare December 11, 2023 00:24

david-cortes reviewed Dec 11, 2023

View reviewed changes

R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved

Remove cnames.

3684e6c

trivialfis merged commit faf0f2d into dmlc:master Dec 12, 2023
30 checks passed

trivialfis deleted the data-columnar branch December 12, 2023 01:56

This was referenced Dec 12, 2023

Categorical data support (part 2) #7899

Open

Roadmap for new R interface #9810

Open

This was referenced Dec 12, 2023

Clarify effect of enable_categorical #9877

Merged

Correct name of function for setting data frames in proxy dmatrix #9905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dataframe data format in native XGBoost. #9828

Support dataframe data format in native XGBoost. #9828

trivialfis commented Nov 30, 2023 •

edited

trivialfis commented Dec 7, 2023

david-cortes left a comment

trivialfis commented Dec 8, 2023

trivialfis commented Dec 8, 2023

david-cortes left a comment

david-cortes Dec 8, 2023

trivialfis Dec 8, 2023

david-cortes Dec 8, 2023

trivialfis Dec 8, 2023

trivialfis Dec 8, 2023

david-cortes Dec 8, 2023

trivialfis Dec 8, 2023

david-cortes Dec 8, 2023

trivialfis Dec 8, 2023

trivialfis commented Dec 8, 2023

trivialfis commented Dec 8, 2023

trivialfis commented Dec 11, 2023

trivialfis commented Dec 11, 2023

david-cortes Dec 11, 2023

trivialfis Dec 12, 2023

trivialfis Dec 12, 2023

david-cortes commented Dec 11, 2023

Support dataframe data format in native XGBoost. #9828

Support dataframe data format in native XGBoost. #9828

Conversation

trivialfis commented Nov 30, 2023 • edited

trivialfis commented Dec 7, 2023

david-cortes left a comment

Choose a reason for hiding this comment

trivialfis commented Dec 8, 2023

trivialfis commented Dec 8, 2023

david-cortes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Dec 8, 2023

trivialfis commented Dec 8, 2023

trivialfis commented Dec 11, 2023

trivialfis commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-cortes commented Dec 11, 2023

trivialfis commented Nov 30, 2023 •

edited