Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with Python package for the approximate method only #1133

Closed
imenelk opened this issue Apr 21, 2016 · 12 comments
Closed

Segmentation fault with Python package for the approximate method only #1133

imenelk opened this issue Apr 21, 2016 · 12 comments

Comments

@imenelk
Copy link

imenelk commented Apr 21, 2016

Hi everyone,
I'm trying to use xgboost for a classification task with fairly big data (25M rows in the training set, the libsvm file is 2.2Go on disk), using the python package.
It works fine when I set the tree_method to 'exact' but I have a segmentation fault with the 'approx' tree_method.
Initially I thought that it was related to the high usage of RAM (I'm using a computer with 32Go RAM), so I'm using the version with external memory (https://github.com/dmlc/xgboost/blob/master/doc/external_memory.md) which create cache files correctly. But I still have a segfault.
I've tried to launch my model with xgboost directly (without using the python package) and it works for both the exact and the approx methods (although it's quite slow).

Here is the code that I'm using in python :
dtrain = xgb.DMatrix('/path/to/data/data_train_libsvm#dtrain.cache')
dval = xgb.DMatrix('/path/to/data/data_val_libsvm#dval.cache')
param = {'booster':'gbtree','silent':0, 'nthread':8,
'eta':0.1, 'max_depth':6, 'subsample':0.8, 'colsample_bytree':0.8, 'scale_pos_weight':12000,
'objective':'binary:logistic', 'eval_metric':'auc' }
watchlist = [(dtrain,'train'), (dval,'eval')]
num_round = 300
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=30)

I've tried to track down the error and it appears during the first booster update : line 750 in file core.py when calling _LIB.XGBoosterUpdateOneIter

If anyone has an idea of what could be going on, I would be super greatful !

Thanks,
Cheers,
Imen

@tqchen
Copy link
Member

tqchen commented Apr 22, 2016

interesting, if you can do gdb and get the backtrace of where the segfault happens, we can take a more careful look.

@imenelk
Copy link
Author

imenelk commented Apr 22, 2016

Hi,
Thanks for your quick answer.
Here is the answer I get from gdb :
Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000000108f30cb7 in rabit::engine::AllreduceBase::TryReduceScatterRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
(gdb) bt full
#0 0x0000000108f30cb7 in rabit::engine::AllreduceBase::TryReduceScatterRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#1 0x0000000108f314db in rabit::engine::AllreduceBase::TryAllreduceRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#2 0x0000000108f34ba5 in rabit::engine::AllreduceBase::Allreduce(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&), void ()(void), void_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#3 0x0000000108f3dc6f in rabit::engine::Allreduce_(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&), rabit::engine::mpi::DataType, rabit::engine::mpi::OpType, void ()(void), void_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#4 0x0000000108ecf483 in xgboost::tree::CQHistMakerxgboost::tree::GradStats::InitWorkSet(xgboost::DMatrix_, xgboost::RegTree const&, std::vector<unsigned int, std::allocator >) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#5 0x0000000108eccc08 in xgboost::tree::HistMakerxgboost::tree::GradStats::Update(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix
, xgboost::RegTree_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#6 0x0000000108ec7059 in xgboost::tree::HistMakerxgboost::tree::GradStats::Update(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix_, std::vector<xgboost::RegTree_, std::allocatorxgboost::RegTree* > const&) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#7 0x0000000108e71045 in xgboost::gbm::GBTree::BoostNewTrees(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix_, long long, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#8 0x0000000108e73b92 in xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix
, long long, std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair >_) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#9 0x0000000108e13f86 in XGBoosterUpdateOneIter ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#10 0x000000010077c7ef in ffi_call_unix64 ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#11 0x000000010077d024 in ffi_call ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#12 0x000000010077863f in _ctypes_callproc ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#13 0x0000000100772c60 in PyCFuncPtr_call ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#14 0x000000010000eeb0 in PyObject_Call () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#15 0x000000010008ea27 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#16 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#17 0x000000010009269a in fast_function () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#18 0x000000010008eaf3 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#19 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#20 0x000000010009269a in fast_function () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#21 0x000000010008eaf3 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
---Type to continue, or q to quit---
No symbol table info available.
#22 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#23 0x0000000100087abc in PyEval_EvalCode () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#24 0x00000001000abea1 in run_mod () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#25 0x00000001000abf44 in PyRun_FileExFlags () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#26 0x00000001000aba93 in PyRun_SimpleFileExFlags ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#27 0x00000001000bd445 in Py_Main () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#28 0x00007fff99dc95ad in start () from /usr/lib/system/libdyld.dylib
No symbol table info available.
#29 0x0000000000000000 in ?? ()
No symbol table info available.

I don't know much about c++ so I'm not sure I understand this message. It seems related to some lib, but I don't really get how I could fix it.
I'm running xgboost on a Mac Pro with el Capitan (32Go RAM).
Thanks a lot !
Imen

@tqchen
Copy link
Member

tqchen commented Apr 22, 2016

Make sure you update the most recent version, specifically pull the most recent version of rabit.
Doing a clean clone might be easier.

This should have been solved in most recent version

@imenelk
Copy link
Author

imenelk commented Apr 23, 2016

I've installed the lib last Monday with a clean clone and used the setup.py file from the package. Did you change anything since then ?
Thanks !

@tqchen
Copy link
Member

tqchen commented Apr 23, 2016

In the latest version of rabit https://github.com/dmlc/rabit/blob/849b20b7c822d194a515cc5587c37764cdf39385/src/allreduce_robust.cc#L82

If you are not in distributed mode, the TryAllreduceScatterRing won't be executed from allreduce function. I fixed in in sometime. But somehow in your case the code still get into this function

@imenelk
Copy link
Author

imenelk commented Apr 25, 2016

Hi,
Thanks for the answer. I've checked that I have the correct version of rabit and this is ok.
But I did some printing and it seems that the python package is not calling allreduce_robust.cc methods but directly allreduce_base.cc
This could explain why I don't have the segfault issue when launching xgboost directly (not through python).
But I don't know how to fix the python call to the c++ lib.
Thanks again for your help!

@TELSER1
Copy link

TELSER1 commented May 6, 2016

I would add that I am also experiencing this behavior with the approximate algorithm, although it seems to be pretty inconsistent with how many observations it can handle; I was feeding 40 million points in a couple days ago, and now it's choking on 7 million from the same dataset. It seems to work fine when set to exact, although that would certainly seem to defeat the purpose of approximation for larger datasets!

A clean clone/install didn't seem to help me, either.

@Far0n
Copy link
Contributor

Far0n commented May 9, 2016

I can confirm this error.

  • clean install from latest sources
  • py wrapper
  • data fits in memory
  • approx tree method crashes for me if tree depth is set greater than 6

I tried to enforce robust allreduce without any luck.

@Far0n
Copy link
Contributor

Far0n commented May 10, 2016

everything works fine if linked against librabit_empty.a
(after changing LIB_RABIT = librabit.a to LIB_RABIT = librabit_empty.a in make/config.mk)

@tqchen
Copy link
Member

tqchen commented May 11, 2016

please check if latest change fixed this problem #1186

@Far0n
Copy link
Contributor

Far0n commented May 11, 2016

seems to work just fine now

@imenelk
Copy link
Author

imenelk commented May 11, 2016

it seems to work fine now, so I close the issue.
Thanks !

@imenelk imenelk closed this as completed May 11, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants