Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC implementation of deterministic GPU histogram. #5361

Merged
merged 2 commits into from Mar 4, 2020

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Feb 25, 2020

  • Use pre-rounding based method to obtain reproducible floating point
    summation.
  • This makes GPU Hist more stable but not yet fully reproducible.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 25, 2020

Experimenting on Year Prediction Dataset with two consecutive runs:

Non-deterministic histogram build:

non-deterministic

Deterministic histogram build:

deterministic

There are a few things to notice here. First the result is quite different between pre-rounding and no pre-rounding. Second is the f score is reproducible on small number of runs ( I haven't done large number of trees yet, only 45 trees are built here), while original code have variance even in small number of runs.

There is still discrepancy in saved model.

@@ -76,6 +76,20 @@ void TestDeviceSketch(bool use_external_memory) {
ASSERT_LT(fabs(hmat_cpu.Values()[i] - hmat_gpu.Values()[i]), eps * nrows);
}

// Determinstic
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Typo

* @param max max(sum(positive gradient) + sum(abs(negative gradient)))
* @param n The number of elements for the sum, or an upper bound estimate
*/
DEV_INLINE __host__ float CreateRoundingFactor(float max, int n) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting you to do this with doubles, this would give an extra 29 bits of precision and might resolve your accuracy issues. We can make the check to only do the truncation if GradientSumT is double precision.

@trivialfis
Copy link
Member Author

Em I still get some non-determinism somewhere. Need more investigation.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 26, 2020

I was able to obtain bit by bit reproducible result (saved model) on 1M rows HIGGS dataset. But started to get discrepancy on ~5M YearPrediction.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 29, 2020

There's minor discrepancy on 5M YearPrediction on may laptop but not on my desktop. On 1M+ Higgs both have discrepancy, arguably much smaller than non-deterministic hist.

I added a new parameter so that we can switch if on certain datasets the accuracy drops too much as a fall back option for user. But so far on YearPrediction and Higgs the accuracy loss is negligible.

@trivialfis
Copy link
Member Author

@RAMitchell After getting dh::SumReduction out of the way, I was able to obtain bit by bit reproducible result on both YearPrediction, Higgs and Epsilon for 100 iterations. As for why cub::DeviceReduce produces non-deterministic result I'm not sure.

@RAMitchell
Copy link
Member

Awesome! Maybe uninitialised memory in node_sum_gradients_d or someone else?

@trivialfis
Copy link
Member Author

@RAMitchell

Maybe uninitialised memory in node_sum_gradients_d

That's not the cause.

@trivialfis
Copy link
Member Author

@RAMitchell Time for training on Year Prediction is up from 1.46069 to 1.6410 for 100 iterations. I don't have more commits coming unless there's further optimization required. Please review when time allows.

@trivialfis
Copy link
Member Author

trivialfis commented Mar 2, 2020

Actually maybe I can pre-compute the rounding.

Update:
Switched to pre-computing the rounding. Slightly faster, but we can't use double for truncation anymore. So I also switched to the bound suggested by @RAMitchell , which is tighter than Algorithm 3. Thus far the accuracy loss is negligible.

Update:
The runtime doesn't seem to be any different. Combined with pre-computing truncation and some other small optimizations, for 1000 rounds HIGGS is 0.4 seconds slower, while year prediction is basically the same (0.07 sec faster for deterministic algorithm).

@mli
Copy link
Member

mli commented Mar 2, 2020

Codecov Report

Merging #5361 into master will increase coverage by 0.31%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5361      +/-   ##
==========================================
+ Coverage   83.75%   84.07%   +0.31%     
==========================================
  Files          11       11              
  Lines        2413     2411       -2     
==========================================
+ Hits         2021     2027       +6     
+ Misses        392      384       -8     
Impacted Files Coverage Δ
python-package/xgboost/compat.py 54.54% <0.00%> (+0.58%) ⬆️
python-package/xgboost/core.py 80.10% <0.00%> (+0.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 655cf17...d00399b. Read the comment docs.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel comfortable doing it with single precision. I think there is a mistake somewhere for the program to slow down after doing the truncation in the histogram kernel. An addition and subtraction is practically nothing for a GPU.

}; // NOLINT

dh::XGBCachingDeviceAllocator<char> alloc;
auto positive_sum = thrust::reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters, but it is possible to fuse these two reductions into a single reduction e.g. apply transform iterator to gradient pair inputs that returns 2 GradientPair (pos/neg) and define an operator to sum them. Alternatively just use the absolute sum, although this is slightly less tight and we could lose up to one bit of precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell I tried using cub::DeviceReudce with 2 cuda streams for this, the end result doesn't seem to be any different. As for abs sum, it failed a Windows test for GPU ranking when using float.


GradientPair CreateRoundingFactor(common::Span<GradientPair> gpair);

DEV_INLINE float TruncateWithRoundingFactor(float rounding_factor, float x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this function could be so expensive that it slows down the algorithm if we use it in the histogram kernel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell I pushed the change for computing truncation on the fly. There are other functions like GetHess, GetGrad and constructors. I'm not sure if the compiler can optimize them away.

* Use pre-rounding based method to obtain reproducible floating point
  summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a Python level integration test. Something like train twice on the same data (with different DMatrix objects) and get bitwise exact results. Just so non-determinism doesn't creep in other areas.

Other than that, amazing work!

@trivialfis
Copy link
Member Author

I think we need a Python level integration test.

Done.

@trivialfis trivialfis merged commit 8d06878 into dmlc:master Mar 4, 2020
@trivialfis trivialfis deleted the determinstic-gpu-hist-max branch March 4, 2020 07:13
@RAMitchell
Copy link
Member

@sh1ng @pseudotensor you might find this feature useful.

@trivialfis trivialfis mentioned this pull request Apr 21, 2020
12 tasks
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants