Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training continuation with multiple DMatrix objects #6148

Closed
ldesreumaux opened this issue Sep 22, 2020 · 3 comments
Closed

Training continuation with multiple DMatrix objects #6148

ldesreumaux opened this issue Sep 22, 2020 · 3 comments

Comments

@ldesreumaux
Copy link
Contributor

ldesreumaux commented Sep 22, 2020

Issue

If a training is started with a DMatrix object and continued with another DMatrix object that is identical (same dataset...), I should get the same model as if the training had been done with a single DMatrix object.

In the following C++ code, two models are trained. In the first training, UpdateOneIter is called two times with the same DMatrix. In the second training, the two calls to UpdateOneIter are made with two different DMatrix objects that contain the same dataset. I expect the two model dumps (out_models1 and out_models2) to be the same, but they are different.

#include <xgboost/learner.h>

using namespace xgboost;

void ConfigureBooster(std::shared_ptr<Learner>& booster) {
    std::vector<std::pair<std::string, std::string>> cfg;

    cfg.emplace_back(std::make_pair("tree_method", "hist"));
    cfg.emplace_back(std::make_pair("objective", "binary:logistic"));
    cfg.emplace_back(std::make_pair("subsample", "0.9"));
    cfg.emplace_back(std::make_pair("seed", "42"));
    cfg.emplace_back(std::make_pair("seed_per_iteration", "1"));

    booster->SetParams(cfg);
    booster->Configure();
}

int main(int argc, char** argv) {
    const std::string dataset_path = "../data/census.bin";
    FeatureMap fmap;

    /* Training with 1 DMatrix object */

    std::shared_ptr<DMatrix> dtrain1(DMatrix::Load(dataset_path, true, false));
    std::shared_ptr<Learner> booster1(Learner::Create({dtrain1}));
    ConfigureBooster(booster1);
    booster1->UpdateOneIter(0, dtrain1);
    booster1->UpdateOneIter(1, dtrain1);
    std::vector<std::string> out_models1 = booster1->DumpModel(fmap, true, "text");
    for (std::string out_model : out_models1)
        std::cout << out_model << std::endl;

    /* Training with 2 DMatrix objects (but same dataset!) */

    std::shared_ptr<DMatrix> dtrain2(DMatrix::Load(dataset_path, true, false));
    std::shared_ptr<Learner> booster2(Learner::Create({dtrain2}));
    ConfigureBooster(booster2);
    booster2->UpdateOneIter(0, dtrain2);
    dtrain2.reset(DMatrix::Load(dataset_path, true, false));
    booster2->UpdateOneIter(1, dtrain2);
    std::vector<std::string> out_models2 = booster2->DumpModel(fmap, true, "text");
    for (std::string out_model : out_models2)
        std::cout << out_model << std::endl;

    return 0;
}

XGBoost version: 1.2.0

Fix

I investigated the issue and found that adding the following lines to GHistIndexMatrix::Init solves the issue:

void GHistIndexMatrix::Init(DMatrix* p_fmat, int max_bins) {
  cut.cut_ptrs_.HostVector().clear();
  cut.cut_values_.HostVector().clear();
  cut.min_vals_.HostVector().clear();
  cut.cut_ptrs_.HostVector().emplace_back(0);

  hit_count.clear();
  hit_count_tloc_.clear();

  ...
@hcho3
Copy link
Collaborator

hcho3 commented Sep 24, 2020

@ldesreumaux Now that you have a fix, would you like to submit a pull request?

@trivialfis
Copy link
Member

I prefer moving it into DMatrix. Let's hold on this a little bit.

@trivialfis
Copy link
Member

Actually, this has already been fixed in #7064 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants