New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal inclusive scan #6234
Minimal inclusive scan #6234
Conversation
This reverts commit 9a3e7fa.
This reverts commit 8e2e611.
I've got a conda error on the Windows test that seems unrelated, otherwise I think this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. How do we plan to test that DDQDM works with elements more than 2^31?
I have this script but it takes > 20s, most of which is in the DMatrix construction time. I don't think that's fast enough for CI. import numpy as np
import xgboost as xgb
import cupy as cp
import time
# Test for integer overflow or out of memory exceptions
def test_large_input():
# n*m = 2^31
n = 1000
m = (1 << 31) // n
m //= 2
X = cp.ones((m, n), dtype=np.float32)
y = cp.ones(m)
start = time.time()
dmat = xgb.DeviceQuantileDMatrix(X, y)
print(time.time() - start)
start = time.time()
xgb.train({"tree_method": "gpu_hist", "max_depth": 1}, dmat, 1)
print(time.time() - start) |
I tried it again on V100 and the test takes 4s so I've tried enabling it but skipping if there is less than 15gb available device memory. |
@RAMitchell Thanks. I've recently change the CI configuration such that each GPU worker runs only one job at a time, so you'll have access to the all 16 GB memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's see if the large matrix test completes in a reasonable amount of time.
|
||
# Test for integer overflow or out of memory exceptions | ||
def test_large_input(): | ||
available_bytes, _ = cp.cuda.runtime.memGetInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this test will skip when --use-rmm-pool
is set, as cuPy is not configured to use the RMM allocator. To make it work with RMM, we'll need to run
cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
Do we want to enable this test with RMM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is necessary for now.
Reported duration of
I think we should keep the test. |
Attempt at a small inclusive scan implementation to get us around the thrust 2^31 limit (#6228).