Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite sparse dmatrix using callbacks. #7092

Merged
merged 30 commits into from Jul 16, 2021

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jul 8, 2021

  • Implemented caching with iterative DMatrix style callback, with simple, deterministic, and lock-free async fetching. Writing to the cache, however, is sequential.
  • The cache file is named by pointer address, should be able to avoid most of the collisions.
  • The concatenation of ellpack page happens in the gradient sampler instead of data loading.
  • (gpu_)page_size is removed. Now the size of each binary block is entirely determined by the batch size provided by the user.

Part of #7070 . This PR handles the internal implementation of external memory, the function is not exposed to Python yet. High level tests are written with custom iterators without dmlc-core parser so they are still at the original PR.

@trivialfis
Copy link
Member Author

For example usage, there's a C demo in the original PR.

@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2021

Codecov Report

Merging #7092 (a4dfb02) into master (84d359e) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #7092   +/-   ##
=======================================
  Coverage   81.59%   81.59%           
=======================================
  Files          13       13           
  Lines        3901     3901           
=======================================
  Hits         3183     3183           
  Misses        718      718           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84d359e...a4dfb02. Read the comment docs.

@trivialfis
Copy link
Member Author

trivialfis commented Jul 10, 2021

Note:

The rewrite is about following things:

  • Reduce dependency on dmlc parsers and provide an interface for user to load data by themselves.
  • Remove dependency on threaded iterator and IO queue.
  • Make sure the number of pages in memory is bounded.
  • Make sure the cache can not be violated.
  • Provide an interface for internal algorithm to process data asynchronously.

src/data/sparse_page_source.h Outdated Show resolved Hide resolved
src/data/sparse_page_source.h Outdated Show resolved Hide resolved
@hcho3 hcho3 self-assigned this Jul 13, 2021
@trivialfis
Copy link
Member Author

@RAMitchell @hcho3 we still have some more tests we need to do in order to get it useful. For example we need to test whether the page might go out of scope in predictor before the possibly async prediction is finished. I will defer those into future PRs.

@RAMitchell
Copy link
Member

Looks good, we will now have a much more flexible external memory implementation, supporting date iterators in other languages, and more easily extending internal data structures to work with external memory.

Good to see lots of tests also.

@trivialfis trivialfis merged commit bd1f3a3 into dmlc:master Jul 16, 2021
@trivialfis trivialfis deleted the rewrite-sparse-dmatrix branch July 16, 2021 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants