Rewrite sparse dmatrix using callbacks. #7092

trivialfis · 2021-07-08T09:03:06Z

Implemented caching with iterative DMatrix style callback, with simple, deterministic, and lock-free async fetching. Writing to the cache, however, is sequential.
The cache file is named by pointer address, should be able to avoid most of the collisions.
The concatenation of ellpack page happens in the gradient sampler instead of data loading.
(gpu_)page_size is removed. Now the size of each binary block is entirely determined by the batch size provided by the user.

Part of #7070 . This PR handles the internal implementation of external memory, the function is not exposed to Python yet. High level tests are written with custom iterators without dmlc-core parser so they are still at the original PR.

trivialfis · 2021-07-08T09:20:49Z

For example usage, there's a C demo in the original PR.

codecov-commenter · 2021-07-08T10:31:55Z

Codecov Report

Merging #7092 (a4dfb02) into master (84d359e) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #7092   +/-   ##
=======================================
  Coverage   81.59%   81.59%           
=======================================
  Files          13       13           
  Lines        3901     3901           
=======================================
  Hits         3183     3183           
  Misses        718      718

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84d359e...a4dfb02. Read the comment docs.

trivialfis · 2021-07-10T12:00:30Z

Note:

The rewrite is about following things:

Reduce dependency on dmlc parsers and provide an interface for user to load data by themselves.
Remove dependency on threaded iterator and IO queue.
Make sure the number of pages in memory is bounded.
Make sure the cache can not be violated.
Provide an interface for internal algorithm to process data asynchronously.

src/data/sparse_page_source.h

trivialfis · 2021-07-13T07:13:45Z

@RAMitchell @hcho3 we still have some more tests we need to do in order to get it useful. For example we need to test whether the page might go out of scope in predictor before the possibly async prediction is finished. I will defer those into future PRs.

RAMitchell · 2021-07-14T09:25:49Z

Looks good, we will now have a much more flexible external memory implementation, supporting date iterators in other languages, and more easily extending internal data structures to work with external memory.

Good to see lots of tests also.

This reverts commit dd2c8a9.

This reverts commit 3d5f319.

trivialfis commented Jul 13, 2021

View reviewed changes

src/data/sparse_page_source.h Outdated Show resolved Hide resolved

src/data/sparse_page_source.h Outdated Show resolved Hide resolved

hcho3 self-assigned this Jul 13, 2021

trivialfis force-pushed the rewrite-sparse-dmatrix branch from 4aa7d48 to d8bbf8b Compare July 14, 2021 05:33

RAMitchell approved these changes Jul 14, 2021

View reviewed changes

hcho3 approved these changes Jul 14, 2021

View reviewed changes

trivialfis added 20 commits July 16, 2021 04:29

Rewrite SparseDMatrix with callbacks.

feb86fa

test.

45ee815

Remove page size.

cfe6866

Initialize sparse page.

7e23d04

remove warning.

19fc649

Don't use parser.

2c6b9c2

Remove assert.

a84a592

Revert "Don't use parser."

222d6d6

This reverts commit dd2c8a9.

Get the flaky back.

045a7cc

Documents.

9e9e48b

ama.

93490c3

Don't handle empty.

7d912e0

Documents.

8e9368c

Test.

2da5cde

Remove decide format.

ba6cbcd

Make everything const.

ad16bbd

Tidy.

0141ff5

Lint.

53bb9ba

Name the cache.

9c3bce6

Make sure it's not written.

784a74a

trivialfis added 10 commits July 16, 2021 04:29

Cleanup tests.

413c06f

Better comments.

f851b1b

Check number of valid futures.

07fd5c9

Covariant return type.

6a6ad78

Return this.

df3fdc2

Reviewers' comments.

6314ccc

Consistent type for n_batches.

cc0be72

Fix doc.

77692d6

Remove raw pointer.

8dc07ca

Revert "Remove raw pointer."

f54b36a

This reverts commit 3d5f319.

trivialfis force-pushed the rewrite-sparse-dmatrix branch from d8bbf8b to f54b36a Compare July 15, 2021 20:30

trivialfis merged commit bd1f3a3 into dmlc:master Jul 16, 2021

trivialfis deleted the rewrite-sparse-dmatrix branch July 16, 2021 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite sparse dmatrix using callbacks. #7092

Rewrite sparse dmatrix using callbacks. #7092

trivialfis commented Jul 8, 2021 •

edited

trivialfis commented Jul 8, 2021

codecov-commenter commented Jul 8, 2021 •

edited

trivialfis commented Jul 10, 2021 •

edited

trivialfis commented Jul 13, 2021

RAMitchell commented Jul 14, 2021

Rewrite sparse dmatrix using callbacks. #7092

Rewrite sparse dmatrix using callbacks. #7092

Conversation

trivialfis commented Jul 8, 2021 • edited

trivialfis commented Jul 8, 2021

codecov-commenter commented Jul 8, 2021 • edited

Codecov Report

trivialfis commented Jul 10, 2021 • edited

trivialfis commented Jul 13, 2021

RAMitchell commented Jul 14, 2021

trivialfis commented Jul 8, 2021 •

edited

codecov-commenter commented Jul 8, 2021 •

edited

trivialfis commented Jul 10, 2021 •

edited