Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting XGBoost model performance #128

Closed
mlsquareup opened this issue Jan 27, 2023 · 17 comments
Closed

Troubleshooting XGBoost model performance #128

mlsquareup opened this issue Jan 27, 2023 · 17 comments

Comments

@mlsquareup
Copy link

Hi,

We're attempting to convert a SparkML Pipeline of [SQLTransformer(simple string replacement for empty/null strings), StringIndexer, OneHotEncoder, VectorAssember, Sparse2Dense, xgboost classifier] into just [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember(optional), Sparse2Dense(optional)].

The output .pmml file for [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember], when loaded via pypmml and called with .predict(), outputs only a few of the derived string columns, not any of the many, many numeric features. Also, the derived string columns do not come out as one hot encodings, or even indexed values. Is there a way to convert a pipeline of [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember(optional)] that outputs the exact output we would get from the original SparkML pipeline? ie it should have all numeric features, the string features string-indexed then one hot encoded?

Context:
We noticed severe performance issues for pmml models that had 10k+ features. The pmml model is a converted [SQLTransformer(simple string replacement for empty/null strings), StringIndexer, OneHotEncoder, VectorAssember, Sparse2Dense, xgboost classifier] spark pipeline. We wanted to determine the cause of the poor performance, so we separated the xgboost classifier and are trying to performance test just the preprocessing portion of the pipeline.

Does this make sense with PMML, to only do the preprocessing portion via pmml? Or should we just do all pmml or not at all?

@vruusmann
Copy link
Member

vruusmann commented Jan 28, 2023

The output .pmml file for [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember], outputs only a few of the derived string columns, not any of the many, many numeric features.

PMML is a high-level ML workflow representation, and is therefore able to operate with string values as-is. In contrast, Apache Spark ML pipeline is a much lower representation, and needs to transform string values into numeric values first (here: map category levels to category indices).

From the PMML perspective, the [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember] pipeline fragment is effectively [SQLTransformer]; the remaining three steps ([StringIndexer, OneHotEncoder, VectorAssember]) are performing some Apache Spark ML-internal bookkeeping, and can be safely omitted, without any loss of information/functionality.

Is there a way to convert a pipeline of [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember(optional)] that outputs the exact output we would get from the original SparkML pipeline?

The JPMML-SparkML library provides API for making redundant features visible by converting them to continuous+double form:

org.jpmml.converter.Schema schema = ...

List<? extends Feature> features = schema.getFeatures();

List<Feature> allNumericFeaures = features.stream()
  // THIS!
  .map(feature -> feature.toContinuous())
  .collect(Collectors.toList());

Schema allNumericSchema = new Schema(schema.getEncoder(), schema.getLabel(), allNumericFeatures);

The org.jpmml.sparkml.PMMLBuilder class does not have a public method for performing such conversion. You could sub-class PMMLBuilder and add it yourself for testing purposes.

Context: We noticed severe performance issues for pmml models that had 10k+ features...

Do you have 10k features entering the pipeline, or exiting it?

Consider ZIP code as a categorical string feature. A PMML representation would accept one string feature, and return one string feature. An Apache Spark ML pipeline would accept one string feature and return 50k+ binary indicator features (one per ZIP code). Clearly, the latter is not sustainable.

The pmml model is a converted [SQLTransformer(simple string replacement for empty/null strings), StringIndexer, OneHotEncoder, VectorAssember, Sparse2Dense, xgboost classifier] spark pipeline.

My educated guess is that the performance loss happens because of the Sparse2Dense transformation step - it expands the number of columns in a dataset very many times.

Do your categorical features contain any missing values or not? If you have both kinds of categorical features, you may split their processing between two sub-pipelines - the first one (with missing values) performs Sparse2Dense, but the second one (without missing values) does not.

Does this make sense with PMML, to only do the preprocessing portion via pmml? Or should we just do all pmml or not at all?

If you want to fix this ML workflow once and for all, then you should simply upgrade the XGBoost library to some 1.5.X version, so that the one-hot-encoding of categorical features happens "natively" inside the XGBoost library. That is, it will be possible to pass the output of StringIndexer step right into the xgboost classifier step; there is absolutely no need for the intermediate [OneHotEncoder, VectorAssember, Sparse2Dense] sequence of steps.

Better yet, upgrade to XGBoost version 1.6.X, and you shall get native multi-category splits on categorical features (as opposed to primitivistic one-category-against-all-other-categories splits as is the case with OHE).

@vruusmann
Copy link
Member

Leaving this issue open as a reminder to implement some kind of "transform smart PMML-level feature representation into dumb Apache Spark ML-level feature representation" functionality, as demonstrated in the above code snippet.

This "dumbing down" requirement applies to other ML frameworks as well (eg. Scikit-Learn). Therefore, it is likely to land in the core JPMML-Converter library.

@vruusmann
Copy link
Member

@mlsquareup What's your target Apache Spark ML version? Also, what's your current XGBoost version, have you considered upgrading it to 1.5+, 1.6+?

Maybe I can do a small tutorial about this topic...

It's year 2023, and nobody should be doing "external OHE plus legacy XGBoost" anymore. It's "native XGBoost" now!

@eugeneyarovoi
Copy link

Hi, I collaborate with @mlsquareup. Wanted to respond to a few points here.

Do you have 10k features entering the pipeline, or exiting it?

Entering it. There's a lot of features. We haven't checked exactly how many are exiting, but it's 10k + a little more. Only a small fraction of the 10k features are string features that will be one-hot encoded.

An Apache Spark ML pipeline would accept one string feature and return 50k+ binary indicator features (one per ZIP code). Clearly, the latter is not sustainable.

We are completely aware of the limitations of one-hot encoding. We don't expect it to work well, or performantly, unless the string column has very few distinct values. This is the only case for which we are using such encodings. We wouldn't attempt to one-hot encode a zip code.

My educated guess is that the performance loss happens because of the Sparse2Dense transformation step

Probably not, because we have a few different datasets and not all of them have missing features. In any case, I suppose what I wanted to check was, if there are 10k features going in, and the XGBoost booster has a few hundred trees that are relatively deep, does it sound reasonable that doing predict() via PMML could take ~600ms? For reference, a solution where we emulated the logic of the one-hot encoding directly in Python code, and then invoked the XGBoost booster directly, took 10-15ms. I'm not sure what level of performance we should expect from PMML.

If you want to fix this ML workflow once and for all, then you should simply upgrade the XGBoost library to some 1.5.X version, so that the one-hot-encoding of categorical features happens "natively" inside the XGBoost library.

I believe to do this, we would at least still need the StringIndexer. The categorical values accepted by XGBoost are integers in the range [0, number_of_categories).

Better yet, upgrade to XGBoost version 1.6.X, and you shall get native multi-category splits on categorical features

We really want to do this, but the newer versions of XGBoost appear to have performance issues for workflows with high numbers of features, which is why we haven't upgraded to 1.5+. This has been to some extent acknowledged by the XGBoost maintainers, see dmlc/xgboost#7214 (comment) and the mention of the Epsilon dataset there.

We're still investigating how we can get around the problem and upgrade, but until we find a solution that doesn't massively regress our training time, we can't upgrade.

What's your current XGBoost version?

We're on 1.0. I know, I know, that must seem preposterous. It is 8x faster than 1.5 on our high-feature-count datasets! We're looking into what settings could be tweaked to restore some of the performance and upgrade.

What's your target Apache Spark ML version?

We're on Spark 3.2

@vruusmann
Copy link
Member

Apparently, upgrading XGBoost won't help any JVM-based applications currently, because the categorical features support won't be around till XGBoost 2.0: dmlc/xgboost#7802

if there are 10k features going in, and the XGBoost booster has a few hundred trees that are relatively deep, does it sound reasonable that doing predict() via PMML could take ~600ms?

Your pipeline doesn't change the total number of features much. But does it transform them? For example, do numeric features undergo scaling, normalization etc.?

The JPMML-SparkML library performs feature usage analysis, and only keeps those features that are actually used by XGBoost split conditions. I'm wondering about the used/unused feature ratio in your pipeline - how many of those 10k features make their way into the PMML document.

You can check this manually - open the PMML file in text editor, and count the number of MiningField@usageType="active" elements under the /PMML/MiningModel/MiningSchema element. Is it 1k, 2k, 5k or close to 10k still?

In any case, a 600 millis prediction time is rather unexpected. Is this the "first version" of a model (ie. trying to get some new idea working), or did the performance of an existing model regress that much?

Also, what's your PMML engine? Is it PyPMML (as mentioned in the opening comment) or is it JPMML-Evaluator-Python? In single prediction mode (ie. one data record at a time), I'd expect JPMML-Evaluator to be able to match XGBoost-via-Python performance.

@vruusmann
Copy link
Member

@eugeneyarovoi Could you please contact me via e-mail so that I could ask more technical details?

@eugeneyarovoi
Copy link

eugeneyarovoi commented Jan 30, 2023

do numeric features undergo scaling, normalization etc.?

No, numeric features are not transformed in any way, since XGBoost models don't need feature normalization or imputation.

In any case, a 600 millis prediction time is rather unexpected. Is this the "first version" of a model (ie. trying to get some new idea working), or did the performance of an existing model regress that much?

It is the first version. The same model was implemented in Python as follows:

  • After running SparkML, open the resulting files and export the string indexing order to a file. So we have some file that says e.g. {"Red": 0, "Blue": 1, "Green": 2}.
  • When the Python container is initialized, it reads this file and stores that mapping in memory. It also loads the native XGBoost booster in memory.
  • At inference time, when the string feature is supplied, we apply Python code which is equivalent to the SQLTransformer we had in SparkML, which only does very basic things like replace empty strings with a marker value. Then, we use the OHE mapping mentioned above to append the appropriate one-hot encoding to the feature vector.
  • Invoke XGBoost with the full feature vector

This implementation ran in about 15ms vs. 600ms for PMML.

Our PMML engine was PyPMML. I believe my coworker who posted earlier tried JPMML-Evaluator, and it did improve performance, but not nearly to the point of the 15ms solution. If this is of interest, I can get more details. We are indeed testing in single-prediction mode.

Just to clarify, "single-prediction mode" isn't some setting that has to be manually enabled, correct? It is just what happens when you test with one row of data? Our tests test average inference time for 1 row of data, since this emulates the online inference setting.

If you like, maybe we can try an "ablation study" where we remove the SQLTransformer and Sparse2Dense steps, to confirm those have nothing to do with the PMML performance. We may be forced to keep SQLTransformer because Spark errors on empty strings in some cases (I don't recall the details). Removing Sparse2Dense will result in a working model that incorrectly treats 0 as missing, but that's OK if it's just for this test.

@vruusmann
Copy link
Member

If this is of interest, I can get more details. We are indeed testing in single-prediction mode.

I'm very much interested in analyzing this misbehaviour in more detail.

I'd need a PMML file, and some test data to run it everything locally. Let's co-ordinate via e-mail.

Just to clarify, "single-prediction mode" isn't some setting that has to be manually enabled, correct?

Yes, it's the default - one java.util.Map goes in, and another java.util.Map comes out.

In principle, this Map-oriented API may be sub-optimal here, because instantiating a java.util.HashMap and populating it with 10k key-value pairs also takes time (especially if the map needs to be re-hashed several times).

In the end, I'd love to run the example in transpiled mode using the JPMML-Transpiler library.

@eugeneyarovoi
Copy link

We can't provide the original PMML file, as it may leak confidential info as it was trained on confidential data.

However, I can try to run it through a transformation where just replace all data with random values (keeping string column cardinality the same), retrain the model, and check that it still exhibits the performance characteristics we've mentioned.

Have you tested PMML with a very high number of input features (10K+) before? Based on our experiences so far, we tend to think PMML pipelines containing OHE + XGBoost with very large numbers of features generally have these performance characteristics.

@vruusmann
Copy link
Member

We can't provide the original PMML file, as it may leak confidential info as it was trained on confidential data.

A PMML file plus 1k .. 10k data records would be sufficient (covers most XGBoost branches, plus warms up the JVM sufficiently).

You can obfuscate a PMML file by hashing field names.

There's even a small command-line application for that (first hash the PMML file, then CSV header cells):
https://github.com/jpmml/jpmml-model/blob/1.6.4/pmml-model-example/src/main/java/org/jpmml/model/example/ObfuscationExample.java

Have you tested PMML with a very high number of input features (10K+) before?

Nothing very systematic.

I'm mostly testing with smaller datasets, as my main goal is to ensure correctness and reproducibility of predictions.

@vruusmann vruusmann changed the title Converting SparkML Pipeline to PMML without attached Model Troubleshooting XGBoost model performance Jan 31, 2023
@vruusmann
Copy link
Member

XGBoost differs from regular Apache Spark ML models, because it converts numeric features from double (64-bit) to float (32-bit).

In XGBoost-native environment, this conversion is very cheap - a primitive value cast. In PMML environment, this conversion may be encoded differently - via field data type declarations (good), or an explicit cast using a special-purpose DerivedField element.

Perhaps the model performance is bad, because the PMML document contains instructions for casting 10k numeric features using 10k DerivedField elements? Yep, that could cost several hundred millis easily.

@eugeneyarovoi Can you help me to identify how are casts to the float data type encoded in your PMML file?

Option one:

  • All numeric features have DataField@dataType="double".
  • For every numeric feature, there is a dedicated DerivedField@name="float(<input_feature>)", which only contains a FieldRef element (referencing the original double-type DataField element).
  • Tree model splits operate with float(<input_feature>) derived field values.

Option two:

  • All numeric features have DataField@dataType="float".
  • Tree model splits operate with <input_feature> data field values directly, there is no intermediary level of DerivedField elements.

Which option describes your PMML file?

@vruusmann
Copy link
Member

Fixed this issue by introducing a org.jpmml.sparkml.xgboost.HasSparkMLXGBoostOptions.OPTION_INPUT_FLOAT model transformation option, which instructs the converter to replace explicit DerivedField-based double-to-float value casts with implicit model schema-based casts.

Consider the Iris species classification:

val labelIndexer = new StringIndexer().setInputCol("Species").setOutputCol("idx_Species")
val labelIndexerModel = labelIndexer.fit(df)

val assembler = new VectorAssembler().setInputCols(Array("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")).setOutputCol("featureVector")

val classifier = new XGBoostClassifier(Map("objective" -> "multi:softprob", "num_class" -> 3, "num_round" -> 17)).setLabelCol(labelIndexer.getOutputCol).setFeaturesCol(assembler.getOutputCol)

val pipeline = new Pipeline().setStages(Array(labelIndexer, assembler, classifier))
val pipelineModel = pipeline.fit(df)

The default behaviour is to accept double values, and perform a double-to-float cast as part of the workflow:

<PMML>
	<DataDictionary>
		<DataField name="Petal_Length" optype="continuous" dataType="double"/>
	</DataDictionary>
	<TransformationDictionary>
		<DerivedField name="float(Petal_Length)" optype="continuous" dataType="float">
			<FieldRef field="Petal_Length"/>
		</DerivedField>
	</TransformationDictionary>
</PMML>

Activating this transformation option:

import org.jpmml.sparkml.PMMLBuilder

new PMMLBuilder(df.schema, pipelineModel).putOption(org.jpmml.sparkml.xgboost.HasSparkMLXGBoostOptions.OPTION_INPUT_FLOAT, true).buildFile(new File("/path/to/XGBoostIris.pmml"))

The new behaviour is to accept float values directly:

<PMML>
	<DataDictionary>
		<DataField name="Petal_Length" optype="continuous" dataType="float"/>
	</DataDictionary>
</PMML>

In the context of the original issue, this transformation eliminates the need to perform 10'000 DerivedField evaluations per prediction.

This transformation is currently "off" by default. It should not have any side effects, if the Spark pipeline only contains XGBoost estimators. This claim is backed by actual integration tests: https://github.com/jpmml/jpmml-sparkml/blob/2.0.2/pmml-sparkml-xgboost/src/test/java/org/jpmml/sparkml/xgboost/testing/XGBoostTest.java#L61-L71

@vruusmann
Copy link
Member

if there are 10k features going in, and the XGBoost booster has a few hundred trees that are relatively deep, does it sound reasonable that doing predict() via PMML could take ~600ms?

Accepting this challenge:

  1. Generate a regression dataset with 10k numeric features.
  2. Train an XGBoost model with n_estimators of 300-500, and max_depth of 5-6.
  3. Convert to PMML, and evaluate using JPMML-Evaluator-Python package.

Expecting to see a <1 ms average prediction time.

Will report back as soon as I have my results (ETA: may 2023).

@vruusmann
Copy link
Member

vruusmann commented Apr 28, 2023

Accepting this challenge

Created a 1000 x 10'000 dataset using SkLearn:

from pandas import DataFrame, Series
from sklearn.datasets import make_regression

X, y = make_regression(n_samples = 1000, n_features = 10000, n_informative = 5000, random_state = 13)

X = DataFrame(X, columns = ["x" + str(i + 1) for i in range(X.shape[1])])
y = Series(y, name = "y")

Trained an XGBoost regressor using Apache Spark 3.4.0 and XGBoost4J-Spark(_2.12) 1.7.5:

val assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("featureVector")

val trackerConf = TrackerConf(0, "scala")
val regressor = new XGBoostRegressor(Map("objective" -> "reg:squarederror", "num_round" -> 500, "max_depth" -> 5, "tracker_conf" -> trackerConf)).setLabelCol(labelCol).setFeaturesCol(assembler.getOutputCol)

val pipeline = new Pipeline().setStages(Array(assembler, regressor))
val pipelineModel = pipeline.fit(df)

Exported the PipelineModel object to the PMML representation in two configurations - "legacy" and "input_float":

import org.jpmml.sparkml.PMMLBuilder

var pmmlBuilder = new PMMLBuilder(schema, pipelineModel)
pmmlBuilder = pmmlBuilder.putOption(org.jpmml.sparkml.model.HasPredictionModelOptions.OPTION_KEEP_PREDICTIONCOL, false)

pmmlBuilder.buildFile(new File("pipeline.pmml"))

pmmlBuilder = pmmlBuilder.putOption(org.jpmml.sparkml.xgboost.HasSparkMLXGBoostOptions.OPTION_INPUT_FLOAT, true)
pmmlBuilder.buildFile(new File("pipeline-float.pmml"))

Evaluated the PMML file with JPMML-Evaluator-Python 0.9.0 and PyPMML 0.9.17:

from jpmml_evaluator import make_evaluator
from pypmml import Model

evaluator = make_evaluator(pmml_path, lax = True)
evaluator.verify()

model = Model.fromFile(pmml_path)

# Evaluate in batch mode
print(timeit.Timer("evaluator.predict(df)", globals = globals()).timeit(number = rounds))

# Evaluate in row-by-row mode
def evaluate_row(X):
	return evaluator.evaluate(X.to_dict())["y"]

print(timeit.Timer("df.apply(evaluate_row, axis = 1)", globals = globals()).timeit(number = rounds))

# Evaluate in batch mode
print(timeit.Timer("model.predict(df)", globals = globals()).timeit(number = rounds))

The dataset was scored twice - first all 10k input features, and then only the actually used 3.5k features. Limiting the dataset using PMML model schema information:

df = pandas.read_csv(csv_path)
print(df.shape)

# Drop unused input columns - retains around 3.5k columns out of initial 10k columns
df = df[[inputField.name for inputField in evaluator.getInputFields()]]
print(df.shape)

Timings for pipeline.pmml ("legacy config"):

Mode Full dataset Limited dataset
JPMML batch 23 ms 10 ms
JPMML row-by-row 38 ms 15 ms
PyPMML batch 1467 ms 525 ms

Timings for pipeline-float.pmml ("input_float" config):

Mode Full dataset Limited dataset
JPMML batch 21 ms 8 ms
JPMML row-by-row 35 ms 14 ms
PyPMML batch 1441 ms 486 ms

@vruusmann vruusmann pinned this issue Apr 28, 2023
@vruusmann
Copy link
Member

vruusmann commented Apr 28, 2023

Conclusions:

This issue was raised because the OP was using PyPMML in full dataset mode. By switching from PyPMML to JPMML-Evaluator-Python it will be possible to speed up the evaluation (1467 ms / 23 ms) = ~63 times instantly.

Next, the "input_float" transformation doesn't seem to be as effective as hoped. It appears to speed up things by ~20%.

However, what helps evaluation speeds considerably is limiting the amount of data transfer between Python and Java environments (sending the feature vector). By eliminating roughly 2/3 of input columns - that are provably not needed by the XGBoost model - it's possible to speed up evaluation (23 ms / 10 ms) = ~2 .. 2.5 times.

Putting everything together, it's possible to go from PyPMML's 1467 ms to JPMML-Evaluator-Python's 8 ms by changing only a couple lines of Python code in the final predict.py script. That's a solid 150x improvement.

@vruusmann
Copy link
Member

Expecting to see a <1 ms average prediction time.

The attained 8 ms average prediction time (on my computer) is still far away from the stated <1 ms goal.

The bottleneck appears to be data transfer between Python and Java environments. In the current case, the solution would be about passing pandas.DataFrame rows as row vectors (eg. in the numpy.ndarray representation).

vruusmann added a commit to jpmml/jpmml-evaluator-python that referenced this issue May 5, 2023
@vruusmann
Copy link
Member

I've just released jpmml_evaluator version 0.10.0 to PyPI, which contains some major optimizations in the area of data exchange between Python and Java environments.

New timings for pipeline.pmml ("legacy config"):

Mode Full dataset Limited dataset
JPMML batch 5.1 ms 4.0 ms
JPMML row-by-row 36.2 ms 14.8 ms

New timings for pipeline-float.pmml ("input_float" config):

Mode Full dataset Limited dataset
JPMML batch 3.6 ms 2.3 ms
JPMML row-by-row 32.7 ms 13.4 ms

Now that the majority of data exchange overhead has been eliminated, it's possible to see that the "input float" transformation option actually makes a difference - the average scoring time falls from 5.1 ms to 3.6 ms for full dataset (all 10'000 input features) and from 4.0 ms 2.3 ms for the limited dataset (actually used 3'638 input features).

Got some ideas for future optimization work. The 1ms goal (on my computer) is not that far off anymore.

vruusmann added a commit to jpmml/jpmml-xgboost that referenced this issue Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants