Segmentation fault with Python package for the approximate method only #1133

imenelk · 2016-04-21T14:35:19Z

Hi everyone,
I'm trying to use xgboost for a classification task with fairly big data (25M rows in the training set, the libsvm file is 2.2Go on disk), using the python package.
It works fine when I set the tree_method to 'exact' but I have a segmentation fault with the 'approx' tree_method.
Initially I thought that it was related to the high usage of RAM (I'm using a computer with 32Go RAM), so I'm using the version with external memory (https://github.com/dmlc/xgboost/blob/master/doc/external_memory.md) which create cache files correctly. But I still have a segfault.
I've tried to launch my model with xgboost directly (without using the python package) and it works for both the exact and the approx methods (although it's quite slow).

Here is the code that I'm using in python :
dtrain = xgb.DMatrix('/path/to/data/data_train_libsvm#dtrain.cache')
dval = xgb.DMatrix('/path/to/data/data_val_libsvm#dval.cache')
param = {'booster':'gbtree','silent':0, 'nthread':8,
'eta':0.1, 'max_depth':6, 'subsample':0.8, 'colsample_bytree':0.8, 'scale_pos_weight':12000,
'objective':'binary:logistic', 'eval_metric':'auc' }
watchlist = [(dtrain,'train'), (dval,'eval')]
num_round = 300
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=30)

I've tried to track down the error and it appears during the first booster update : line 750 in file core.py when calling _LIB.XGBoosterUpdateOneIter

If anyone has an idea of what could be going on, I would be super greatful !

Thanks,
Cheers,
Imen

tqchen · 2016-04-22T04:11:38Z

interesting, if you can do gdb and get the backtrace of where the segfault happens, we can take a more careful look.

imenelk · 2016-04-22T09:21:16Z

Hi,
Thanks for your quick answer.
Here is the answer I get from gdb :
Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000000108f30cb7 in rabit::engine::AllreduceBase::TryReduceScatterRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
(gdb) bt full
#0 0x0000000108f30cb7 in rabit::engine::AllreduceBase::TryReduceScatterRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#1 0x0000000108f314db in rabit::engine::AllreduceBase::TryAllreduceRing(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&)) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#2 0x0000000108f34ba5 in rabit::engine::AllreduceBase::Allreduce(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&), void ()(void), void_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#3 0x0000000108f3dc6f in rabit::engine::Allreduce_(void_, unsigned long, unsigned long, void ()(void const, void_, int, MPI::Datatype const&), rabit::engine::mpi::DataType, rabit::engine::mpi::OpType, void ()(void), void_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#4 0x0000000108ecf483 in xgboost::tree::CQHistMakerxgboost::tree::GradStats::InitWorkSet(xgboost::DMatrix_, xgboost::RegTree const&, std::vector<unsigned int, std::allocator >) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#5 0x0000000108eccc08 in xgboost::tree::HistMakerxgboost::tree::GradStats::Update(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix, xgboost::RegTree_) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#6 0x0000000108ec7059 in xgboost::tree::HistMakerxgboost::tree::GradStats::Update(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix_, std::vector<xgboost::RegTree_, std::allocatorxgboost::RegTree* > const&) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#7 0x0000000108e71045 in xgboost::gbm::GBTree::BoostNewTrees(std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair > const&, xgboost::DMatrix_, long long, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >) ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#8 0x0000000108e73b92 in xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, long long, std::vector<xgboost::bst_gpair, std::allocatorxgboost::bst_gpair >_) () from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#9 0x0000000108e13f86 in XGBoosterUpdateOneIter ()
from /usr/local/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboost.so
No symbol table info available.
#10 0x000000010077c7ef in ffi_call_unix64 ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#11 0x000000010077d024 in ffi_call ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#12 0x000000010077863f in _ctypes_callproc ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#13 0x0000000100772c60 in PyCFuncPtr_call ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_ctypes.so
No symbol table info available.
#14 0x000000010000eeb0 in PyObject_Call () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#15 0x000000010008ea27 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#16 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#17 0x000000010009269a in fast_function () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#18 0x000000010008eaf3 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#19 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#20 0x000000010009269a in fast_function () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#21 0x000000010008eaf3 in PyEval_EvalFrameEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
---Type to continue, or q to quit---
No symbol table info available.
#22 0x00000001000880f1 in PyEval_EvalCodeEx () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#23 0x0000000100087abc in PyEval_EvalCode () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#24 0x00000001000abea1 in run_mod () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#25 0x00000001000abf44 in PyRun_FileExFlags () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#26 0x00000001000aba93 in PyRun_SimpleFileExFlags ()
from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#27 0x00000001000bd445 in Py_Main () from /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Python
No symbol table info available.
#28 0x00007fff99dc95ad in start () from /usr/lib/system/libdyld.dylib
No symbol table info available.
#29 0x0000000000000000 in ?? ()
No symbol table info available.

I don't know much about c++ so I'm not sure I understand this message. It seems related to some lib, but I don't really get how I could fix it.
I'm running xgboost on a Mac Pro with el Capitan (32Go RAM).
Thanks a lot !
Imen

tqchen · 2016-04-22T16:31:44Z

Make sure you update the most recent version, specifically pull the most recent version of rabit.
Doing a clean clone might be easier.

This should have been solved in most recent version

imenelk · 2016-04-23T06:36:34Z

I've installed the lib last Monday with a clean clone and used the setup.py file from the package. Did you change anything since then ?
Thanks !

tqchen · 2016-04-23T15:58:19Z

In the latest version of rabit https://github.com/dmlc/rabit/blob/849b20b7c822d194a515cc5587c37764cdf39385/src/allreduce_robust.cc#L82

If you are not in distributed mode, the TryAllreduceScatterRing won't be executed from allreduce function. I fixed in in sometime. But somehow in your case the code still get into this function

imenelk · 2016-04-25T14:07:21Z

Hi,
Thanks for the answer. I've checked that I have the correct version of rabit and this is ok.
But I did some printing and it seems that the python package is not calling allreduce_robust.cc methods but directly allreduce_base.cc
This could explain why I don't have the segfault issue when launching xgboost directly (not through python).
But I don't know how to fix the python call to the c++ lib.
Thanks again for your help!

TELSER1 · 2016-05-06T00:00:19Z

I would add that I am also experiencing this behavior with the approximate algorithm, although it seems to be pretty inconsistent with how many observations it can handle; I was feeding 40 million points in a couple days ago, and now it's choking on 7 million from the same dataset. It seems to work fine when set to exact, although that would certainly seem to defeat the purpose of approximation for larger datasets!

A clean clone/install didn't seem to help me, either.

Far0n · 2016-05-09T07:23:24Z

I can confirm this error.

clean install from latest sources
py wrapper
data fits in memory
approx tree method crashes for me if tree depth is set greater than 6

I tried to enforce robust allreduce without any luck.

Far0n · 2016-05-10T17:40:38Z

everything works fine if linked against librabit_empty.a
(after changing LIB_RABIT = librabit.a to LIB_RABIT = librabit_empty.a in make/config.mk)

tqchen · 2016-05-11T03:11:43Z

please check if latest change fixed this problem #1186

Far0n · 2016-05-11T05:21:31Z

seems to work just fine now

imenelk · 2016-05-11T07:49:26Z

it seems to work fine now, so I close the issue.
Thanks !

imenelk closed this as completed May 11, 2016

lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault with Python package for the approximate method only #1133

Segmentation fault with Python package for the approximate method only #1133

imenelk commented Apr 21, 2016

tqchen commented Apr 22, 2016

imenelk commented Apr 22, 2016

tqchen commented Apr 22, 2016

imenelk commented Apr 23, 2016

tqchen commented Apr 23, 2016

imenelk commented Apr 25, 2016

TELSER1 commented May 6, 2016

Far0n commented May 9, 2016

Far0n commented May 10, 2016

tqchen commented May 11, 2016

Far0n commented May 11, 2016

imenelk commented May 11, 2016

Segmentation fault with Python package for the approximate method only #1133

Segmentation fault with Python package for the approximate method only #1133

Comments

imenelk commented Apr 21, 2016

tqchen commented Apr 22, 2016

imenelk commented Apr 22, 2016

tqchen commented Apr 22, 2016

imenelk commented Apr 23, 2016

tqchen commented Apr 23, 2016

imenelk commented Apr 25, 2016

TELSER1 commented May 6, 2016

Far0n commented May 9, 2016

Far0n commented May 10, 2016

tqchen commented May 11, 2016

Far0n commented May 11, 2016

imenelk commented May 11, 2016