You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks gpu_hist bugs relating to large workloads, uncovered in recent experiments on the mortgage dataset.
Thrust copy_if has an integer overflow when n_rows*n_cols > 2^31. Loop over copy_if #6201 implements a workaround by iterating over batches.
Memory usage has possibly increased since version 1.0-1.1, leading to OOM on 32gb devices even with the above fix. We should do some analysis of peak memory usage over versions on a large synthetic workload, checking for regressions.
DaskDeviceQuantileDMatrix has integer overflow bugs related to thrust::inclusive_scan, occurring when dask chunk sizes exceed 2^31.
To prevent this occurring in future we can try unit tests on large sizes, checking for overflow or memory issues. These tests need to be carefully designed to not be flaky (e.g. only run on a machine with sufficient memory) and to run quickly (<1-2 seconds).
The text was updated successfully, but these errors were encountered:
This issue tracks gpu_hist bugs relating to large workloads, uncovered in recent experiments on the mortgage dataset.
Thrust copy_if has an integer overflow when n_rows*n_cols > 2^31. Loop over copy_if #6201 implements a workaround by iterating over batches.
Memory usage has possibly increased since version 1.0-1.1, leading to OOM on 32gb devices even with the above fix. We should do some analysis of peak memory usage over versions on a large synthetic workload, checking for regressions.
DaskDeviceQuantileDMatrix has integer overflow bugs related to thrust::inclusive_scan, occurring when dask chunk sizes exceed 2^31.
To prevent this occurring in future we can try unit tests on large sizes, checking for overflow or memory issues. These tests need to be carefully designed to not be flaky (e.g. only run on a machine with sufficient memory) and to run quickly (<1-2 seconds).
The text was updated successfully, but these errors were encountered: