Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal inclusive scan #6234

Merged
merged 10 commits into from Nov 3, 2020
Merged

Minimal inclusive scan #6234

merged 10 commits into from Nov 3, 2020

Conversation

RAMitchell
Copy link
Member

Attempt at a small inclusive scan implementation to get us around the thrust 2^31 limit (#6228).

@RAMitchell RAMitchell changed the title [WIP] Minimal inclusive scan Minimal inclusive scan Nov 2, 2020
@RAMitchell RAMitchell marked this pull request as ready for review November 2, 2020 22:59
@RAMitchell
Copy link
Member Author

I've got a conda error on the Windows test that seems unrelated, otherwise I think this is ready to go.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. How do we plan to test that DDQDM works with elements more than 2^31?

@RAMitchell
Copy link
Member Author

I have this script but it takes > 20s, most of which is in the DMatrix construction time. I don't think that's fast enough for CI.

import numpy as np
import xgboost as xgb
import cupy as cp
import time


# Test for integer overflow or out of memory exceptions
def test_large_input():
    # n*m = 2^31
    n = 1000
    m = (1 << 31) // n
    m //= 2
    X = cp.ones((m, n), dtype=np.float32)
    y = cp.ones(m)
    start = time.time()
    dmat = xgb.DeviceQuantileDMatrix(X, y)
    print(time.time() - start)
    start = time.time()
    xgb.train({"tree_method": "gpu_hist", "max_depth": 1}, dmat, 1)
    print(time.time() - start)

@RAMitchell
Copy link
Member Author

I tried it again on V100 and the test takes 4s so I've tried enabling it but skipping if there is less than 15gb available device memory.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 3, 2020

@RAMitchell Thanks. I've recently change the CI configuration such that each GPU worker runs only one job at a time, so you'll have access to the all 16 GB memory.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's see if the large matrix test completes in a reasonable amount of time.


# Test for integer overflow or out of memory exceptions
def test_large_input():
available_bytes, _ = cp.cuda.runtime.memGetInfo()
Copy link
Collaborator

@hcho3 hcho3 Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this test will skip when --use-rmm-pool is set, as cuPy is not configured to use the RMM allocator. To make it work with RMM, we'll need to run

cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)

Do we want to enable this test with RMM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is necessary for now.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 3, 2020

Reported duration of test_large_input() in the CI:

  • test-python-gpu-cuda10.2: 7.82s
  • test-python-gpu-cuda11.0: 7.64s
  • test-python-gpu-cuda11.0-cross: 7.85s

I think we should keep the test.

@RAMitchell RAMitchell merged commit 29745c6 into dmlc:master Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants