[pyspark] Cleanup data processing. #8088

trivialfis · 2022-07-17T06:04:06Z

This PR does some cleanups for the data processing precedures for the newly gained pyspark interface.

Use numpy stack for handling list of arrays.
Reuse concat function from dask.
Remove unused code.
Use iterator for prediction to avoid initializing xgboost model (pickle is not cheap).

To-dos:

Test for different types of inputs.

I'm not entirely sure how to work with sparse data since it hasn't been supported yet in current codebase.

@WeichenXu123

trivialfis · 2022-07-17T06:10:41Z

@wbo4958 Please upstream your work on the support of cuDF when you are available.

python-package/xgboost/spark/data.py

wbo4958 · 2022-07-18T05:28:20Z

Seems this PR has changed the original way to build DMatrix, can you merge it first, then I will add the whole GPU pipeline support? @trivialfis

python-package/xgboost/spark/data.py

WeichenXu123 · 2022-07-18T14:37:18Z

@trivialfis

I'm not entirely sure how to work with sparse data since it hasn't been supported yet in current codebase.

Let me do this.
It requires a new spark function API unwrap_udt (will added in spark 3.4)
and it supports the pyspark.ml.linalg.SparseVector type features column.
the unwrap_udt API will unwrap pyspark.ml.linalg.SparseVector column into a struct of sparse vector active value indexes array, active value array, etc. and so that we can convert the struct data into DMatrix in spark python UDF.

python-package/xgboost/spark/data.py

trivialfis · 2022-07-19T10:34:19Z

Thank you for the quick reviews. ;-) This branch is still working in progress, sharing it as a sketch to discuss how we might proceed with the data processing procedures along with support for GPU. I will update the branch and remove the WIP tag once it's ready.

python-package/xgboost/spark/data.py

WeichenXu123 · 2022-07-20T04:51:52Z

@trivialfis
It will be great if you can prioritize this PR,
so that it can unblock follow-up work including data iteration input, external memory, sparse vector support, etc.

trivialfis · 2022-07-20T05:31:57Z

Apologies for the delay. I was trying to get #8050 ready and mitigate the memory usage surge with it. Will focus on this PR now.

python-package/xgboost/spark/data.py

WeichenXu123 · 2022-07-21T13:41:05Z

python-package/xgboost/spark/data.py

+            else:
+                train_data[name].append(array)
+
+    cache_partitions(iterator, append)


For DeviceQuantileDMatrix, it supports iterator as input, we don't need to load the whole partition data into memory, right ?

Writing it down to disk would be quite slow and we need to iterate through it twice to finish construction. If memory usage is prioritized over efficiency, one can choose external memory. (For GPU we still concatenate the data internally to avoid disk IO, but users can use sampling + external memory to reduce size for GPU hist).

Current code (for GPU) is:

load whole partition data into memory arrays

construct an iterator from memory arrays in step (1)

construct DMatrix from iterator in step (2)

What I propose it not to write data to disk, but change step (1)(2) into: construct an iterator from spark python UDF data iterator. This avoid load whole partition data into memory in (1) step.

We can do this improvement in follow-up PR.

@WeichenXu123 I think the data is still loaded into memory. We need to iterate through the data twice (notice the reset function). Unless the spark iterator can be reset and start again, we need to cache the data somewhere.

ah, got it , spark udf data iterator does not support reset. that's bad.

Previously I thought DMatrix should accept an one-shot iterator, it exhausts the iterator and cache all data into DMatrix format in memory, and then xgboost training can use the DMatrix data in mutilple passes. Why DMatrix iteration API is not designed this way ?

@WeichenXu123 Apologies for the confusion here. Here's a little bit of clarification on the difference between different DMatric constructors:

Normal DMatrix: DMatrix(np.array(...))
It accepts one single batch of data, and constructs an internal CSR representation that can be used by all algorithms. It has 2 issues with distributed training: firstly, it needs to concatenate all partitions, which doubles the memory usage, secondly, it's a csr, which can triple the memory usage when input is dense.

QuantileDMatrix: This is designed to be as "inplace" as possible and can only be used by GPU hist. It omits the CSR representation and constructs the internal histogram index directly from external data (including quantilization). The iterator concept you see there is designed for a distributed system that handles data in the form of partitions, the sole purpose of accepting an iterator instead of a whole blob here is to avoid the concatenate function on the input data, which doubles the memory. (concat_or_none in this PR.) As a result, we cannot have an internal representation (CSR) and need to iterate through the input partitions multiple times before we can start training since we are trying not to make any copy.

External memory: DMatrix(iterator). Yes, using CPU hist/approx DMatrix caches the batches on disk and iterates through it. But it's quite slow since we need to iterate through the data 3-4 times for each layer of the tree, you can estimate the time usage based on the throughput of your hard disk.

Got it.

so QuantileDMatrix this does not support sparse data input ?

I would like to assign all data iterator related work to you because you have deeper understanding on this. :)

so QuantileDMatrix this does not support sparse data input ?

It will support CSR once the CPU impl is merged.

I would like to assign all data iterator related work to you because you have deeper understanding on this. :)

Got it.

python-package/xgboost/spark/data.py

trivialfis · 2022-07-22T03:24:17Z

@wbo4958 @WeichenXu123 Could you please take another look?

WeichenXu123 · 2022-07-22T08:28:03Z

overall good, @trivialfis Have you test whether it causes performance regression ?

python-package/xgboost/spark/data.py

WeichenXu123 · 2022-07-22T08:58:05Z

python-package/xgboost/spark/data.py

-    )
-    return training_dmatrix
+
+    is_dmatrix = feature_cols is None


why use DeviceQuantileDMatrix only when feature_cols is not None ?

Use DeviceQuantileDMatrix or not should be controlled by use_gpu param if I understand correctly?

So far QuantileDMatrix is GPU only until #8050 is merged. feature_cols will also be a GPU-only thing for the recent future. (Actually I'm not entirely sure about the feature_cols parameter that's available on Jvm packages, I don't know how does it work with spark ml pipeline, I will leave these questions to @wbo4958 ).

We don't need to support feature_cols(multiple features columns) in xgboost pyspark estimator. The transformers in spark pipeline can assemble multiple feature columns into one vector type feature column.

I think @wbo4958 wanted to avoid the vector assembler since xgboost needs to undo it. (the stack_series). He wanted to have the input data just like other python libraries with 1 column per feature. I think it's a reasonable optimization looking solely on XGBoost, but might not play well with spark ml pipelines.

Got it, I will remove it. cc @wbo4958 .

@WeichenXu123, yeah, it indeed does not play well with some spark ml pipelines, but we can give users some warning about that. Some users may not use spark ml pipeline or meta-estimators at all, they may just want to train an XGBoost model. I think we should provide this.

@wbo4958
OK. But I doubt "1 column per feature" in spark dataframe and pass many columns to python UDF it might not have good performance when there're many features. We'd better benchmark it.

@WeichenXu123, Yes, we will do some benchmark testing. BTW, may I ask why "pass many columns to python UDF it might not have good performance when there're many features", is the penalty happening on the final data size of ArrowRecordBatch?

is the penalty happening on the final data size of ArrowRecordBatch?

Yes, that's my concern. But I am not expert on this, maybe I am wrong. Let's benchmark.

python-package/xgboost/spark/data.py

WeichenXu123

LGTM

wbo4958

LGTM

wbo4958 · 2022-07-25T06:39:16Z

python-package/xgboost/spark/data.py

+    else:
+        cache_partitions(iterator, append_dqm)
+        it = PartIter(train_data, True)
+        dtrain = DeviceQuantileDMatrix(it, **kwargs)


Seems DeviceQuantileDMatrix needs the extra "max_bin", while kwargs does not contain this. But it's ok to file a follow-up for this

Let's mark a TODO there, to prevent forget it.

@wbo4958 @trivialfis Does DMatrix need max_bin param ?
And could you help me check whether DMatrix need other param settings, I might miss some params in my PR.

It doesn't. However, #8087 needs a fix.

- Use numpy stack for handling list of arrays. - Reuse concat function from dask. - Prepare for `QuantileDMatrix`. - Remove unused code. - Use iterator for prediction to avoid initializing xgboost model

WeichenXu123 · 2022-07-25T13:02:09Z

If use feature_cols, we should check that this only supports GPU (e.g. use_gpu param must be True). @wbo4958 You should add this logic in follow-up PR.

WeichenXu123 · 2022-07-25T13:02:45Z

@trivialfis Have you test whether it causes performance regression ? The code has big changes.

trivialfis · 2022-07-25T15:09:54Z

Have you test whether it causes performance regression ? The code has big changes.

There's no performance change. Actually, I was hoping numpy can deliver something faster than python list. Seems not much can be done there.

I implemented a very primitive pyspark interface for xgboost for self-education purposes before. I handled the features column by expanding it into multiple columns first before passing it to the local fit function. But it handles only vector and array. Wondering if it's a possible future change.

WeichenXu123 · 2022-07-26T00:40:45Z

I handled the features column by expanding it into multiple columns first before passing it to the local fit function

I guess it might not increase performance, but has risk to downgrade performance

wbo4958 · 2022-07-26T01:00:17Z

If use feature_cols, we should check that this only supports GPU (e.g. use_gpu param must be True). @wbo4958 You should add this logic in follow-up PR.

Sure, actually, I had the follow up PR up

trivialfis · 2022-07-26T04:43:08Z

guess it might not increase performance, but has risk to downgrade performance

I thought spark might handle large number of samples faster than Python. Python list is usually really slow compared to optimized procedures

trivialfis mentioned this pull request Jul 17, 2022

Apply dmatrix iteration iterface in PySpark xgboost and support external memory mode #8083

Closed

WeichenXu123 reviewed Jul 18, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 18, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 18, 2022

View reviewed changes

python-package/xgboost/spark/data.py Show resolved Hide resolved

wbo4958 reviewed Jul 19, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 19, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 19, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 19, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

trivialfis force-pushed the pyspark-data branch from e838bc7 to a9d9550 Compare July 20, 2022 13:59

WeichenXu123 reviewed Jul 21, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

trivialfis force-pushed the pyspark-data branch from a9d9550 to 27d7cb8 Compare July 21, 2022 05:27

trivialfis changed the title ~~[WIP] [pyspark] Cleanup data processing.~~ [pyspark] Cleanup data processing. Jul 21, 2022

WeichenXu123 reviewed Jul 21, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 22, 2022

View reviewed changes

python-package/xgboost/spark/data.py Show resolved Hide resolved

WeichenXu123 reviewed Jul 22, 2022

View reviewed changes

python-package/xgboost/spark/data.py Outdated Show resolved Hide resolved

WeichenXu123 approved these changes Jul 22, 2022

View reviewed changes

wbo4958 approved these changes Jul 24, 2022

View reviewed changes

wbo4958 reviewed Jul 25, 2022

View reviewed changes

pyspark] Cleanup data processing.

a7c98c6

- Use numpy stack for handling list of arrays. - Reuse concat function from dask. - Prepare for `QuantileDMatrix`. - Remove unused code. - Use iterator for prediction to avoid initializing xgboost model

trivialfis force-pushed the pyspark-data branch from 089260d to a7c98c6 Compare July 25, 2022 11:02

Format.

320d0e8

Format.

5cb8589

trivialfis merged commit 546de5e into dmlc:master Jul 26, 2022

trivialfis deleted the pyspark-data branch July 26, 2022 07:01

hcho3 mentioned this pull request Nov 16, 2022

Xgboost regressor training with GPU does not work when python environment does not have "cudf" package #8467

Closed

[pyspark] Cleanup data processing. #8088

[pyspark] Cleanup data processing. #8088

Conversation

trivialfis commented Jul 17, 2022 • edited

trivialfis commented Jul 17, 2022

wbo4958 commented Jul 18, 2022

WeichenXu123 commented Jul 18, 2022

trivialfis commented Jul 19, 2022

WeichenXu123 commented Jul 20, 2022

trivialfis commented Jul 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Jul 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 22, 2022 • edited

Choose a reason for hiding this comment

trivialfis commented Jul 22, 2022

WeichenXu123 commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Jul 22, 2022 • edited

Choose a reason for hiding this comment

trivialfis Jul 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

wbo4958 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Jul 25, 2022

WeichenXu123 commented Jul 25, 2022

trivialfis commented Jul 25, 2022

WeichenXu123 commented Jul 26, 2022

wbo4958 commented Jul 26, 2022

trivialfis commented Jul 26, 2022

trivialfis commented Jul 17, 2022 •

edited

WeichenXu123 Jul 22, 2022 •

edited

trivialfis Jul 22, 2022 •

edited

trivialfis Jul 22, 2022 •

edited

WeichenXu123 Jul 22, 2022 •

edited

trivialfis Jul 22, 2022 •

edited