Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] solve timeouts #5505

Closed
wants to merge 4 commits into from
Closed

[tests][dask] solve timeouts #5505

wants to merge 4 commits into from

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Sep 24, 2022

Sometimes the dask tests get stuck on the _train_part function with the following call stack:

File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/threading.py", line 937, in _bootstrap self._bootstrap_inner()
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run()
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs)
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/site-packages/distributed/threadpoolexecutor.py", line 57, in _worker task.run()
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/site-packages/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/site-packages/distributed/worker.py", line 2882, in apply_function msg = apply_function_simple(function, args, kwargs, time_delay)
File "/home/jose/mambaforge/envs/lgb-test/lib/python3.9/site-packages/distributed/worker.py", line 2904, in apply_function_simple result = function(*args, **kwargs)
File "/hdd/github/LightGBM/python-package/lightgbm/dask.py", line 322, in _train_part model.fit(
File "/hdd/github/LightGBM/python-package/lightgbm/sklearn.py", line 1084, in fit super().fit(
File "/hdd/github/LightGBM/python-package/lightgbm/sklearn.py", line 797, in fit self._Booster = train(
File "/hdd/github/LightGBM/python-package/lightgbm/engine.py", line 223, in train booster = Booster(params=params, train_set=train_set)
File "/hdd/github/LightGBM/python-package/lightgbm/basic.py", line 2775, in __init__ train_set.construct()
File "/hdd/github/LightGBM/python-package/lightgbm/basic.py", line 1923, in construct self._lazy_init(self.data, label=self.label,
File "/hdd/github/LightGBM/python-package/lightgbm/basic.py", line 1578, in _lazy_init self.__init_from_np2d(data, params_str, ref_dataset)
File "/hdd/github/LightGBM/python-package/lightgbm/basic.py", line 1708, in __init_from_np2d _safe_call(_LIB.LGBM_DatasetCreateFromMat( 

Which I'm able to reproduce locally by trying to train 100 consecutive times.

I haven't been able to reproduce this with the number of threads in the workers equal to 1.

@Remy-Luciani
Copy link
Contributor

Suggestion: if it fails on a Linux environment, could we make a strace on the test process to see on which syscall the process is hanging?

@jameslamb
Copy link
Collaborator

@Remy-Luciani I don't know what an strace is, but sounds interesting!

If you could describe how to do what you're referring to (even just a link to documentation), we could try it out.

@Remy-Luciani
Copy link
Contributor

To make it simple, strace is a CLI tool to trace system calls between processes and Linux kernel. You can attach a strace process to another process, and it will prints syscalls on stderr output (or in a specified file).

At first it looks hard to read because system calls are abbreviated C function like fopen, mmap... But at some point the API becomes clear and we don't need 100% of information to find bug causes most of times.

strace is not included in bash nor in most of Linux distros so you need to install it with your package manager.

So some resources:

What could be done for the hanging test process is launching the command with strace and specify an output file:

strace --output-file=test_strace.log test-command

Moreover, since there seems to be some parallelism involved, you might need to follow process forks/child process:

strace --follow-forks --output-file=test_strace.log test-command

If you want to print system traces in different files for each sub-process I encourage you to take a look at the -ff flag in the manual.

Let me know if you're struggling with using the tool or reading the trace! :)

@jameslamb
Copy link
Collaborator

Very cool, thank you for taking the time to write that out!! That could be an interesting way to approach debugging this, and I know I'll use it for other things in the future.

@jameslamb
Copy link
Collaborator

@jmoralez since I see you're pushing commits here, want to be sure.... did you see @shiyu1994 's description of what the root cause might be?

#5510 (comment)

and my suggestion for something to try

#5510 (comment)

@jmoralez
Copy link
Collaborator Author

Yes I saw it, I just wanted to test on mac in the CI here because I wasn't able to replicate it on my mac machine and wanted to check if it was linux specific. I see many failures atm but they don't seem to be stuck yet.

@jmoralez
Copy link
Collaborator Author

nvm, seems like 2/3 are going to timeout

@jmoralez jmoralez closed this Sep 30, 2022
@jmoralez jmoralez deleted the dask-tests-timeouts branch October 3, 2022 17:10
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants