[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

jameslamb · 2022-09-27T03:53:41Z

conda sometimes "downgrades" Python from a CPython build to a PyPy build in our Windows CI jobs. This has historically been because of dependencies introduced by either the python-graphviz or matplotlib conda packages.

~~This PR proposes trying to prevent that situation by explicitly passing flag --no-update-deps.~~

~~From the conda docs (link)~~

~~--freeze-installed, --no-update-deps~~
~~Do not update or change already-installed dependencies.~~

This PR proposes explicitly installing python={version}[build=*cpython] to prevent conda from environment solves that switch to pypy-based builds of Python.

I think this should be possible, based on https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#package-match-specifications.

jameslamb · 2022-09-27T13:41:42Z

This worked! (https://ci.appveyor.com/project/guolinke/lightgbm/builds/44896165/job/xvf4hw3hv45lchyf)

Pushing another commit re-enabling all the other CI that I'd skipped just to save time, but then I think we should adopt this fix. @shiyu1994 @StrikerRUS

This reverts commit 704aea4.

jmoralez · 2022-09-27T14:34:54Z

It seems the linux jobs are getting stuck as well in this PR. I investigated a bit while working on #5505 but the environments seemed the same, it's strange that it started to happen now. Also they sometimes randomly pass.

jameslamb · 2022-09-27T14:42:49Z

😭 this is getting so complicated

It looks like the two Azure DevOps Linux jobs that are getting stuck (have been running for more than 50 mins) are both Linux, which means the image built from https://github.com/guolinke/lightgbm-ci-docker.

it's the next thing on my list to try to upgrade that image, so that we can hopefully remove the pin on dask=2022.7.0: #5390 (comment).

I'm going to just try manually rebuilding the timed-out Linux jobs here when they fail, to at least make some forward progress. "We sometimes have to manually re-run builds" is a bad state to be in, but not as bad as "Appveyor is failing on every commit and nothing can be merged to master".

jameslamb · 2022-09-28T04:14:31Z

I've tried rebuilding the timed-out CUDA and Azure jobs a few times today, hoping to get lucky and have the builds not hit the Dask timeouts...so far, I haven't been successful.

I'll be traveling for the next few days, and I'm not sure how much I'll be able to work on LightGBM during that time.

@shiyu1994 @jmoralez @StrikerRUS if you're able to find a workaround for the Dask issues, it's ok with me if you want to push such fixes directly to this branch. So we can have one PR that resolves the CI issues.

shiyu1994 · 2022-09-29T04:36:52Z

Thanks. Good to see that the windows environment issue is resolved. Let's see whether we can fix the Dask issue together in this PR.

shiyu1994 · 2022-09-29T04:41:28Z

I can try to manually debug the Dask tests on our self-hosted CUDA CI agent. Hopefully we can find a solution soon.

jameslamb · 2022-09-29T12:46:22Z

We could also try pinning all dask and distributed to a specific version, to try to isolate whether this new timeout issue is related to Dask itself or to other changes in the CI environment.

shiyu1994 · 2022-09-30T12:42:03Z

I think I've found the root of the timeout of Dask tests.

This is because Network class in our C++ code is defined as a static class, which is shared by all boosters of the same process.

However, because in our test_dask.py, a single LocalCluster is created for all the test cases, and the LocalCluster keeps the same subprocesses during the whole run of test_dask.py, without reinitializing them.

Then comes the issue. When a booster A is created in test case 1, it still exists in the memory space of the process. Then when we move to test case 2, a new booster B will be created. However, at this moment, somehow the garbage collection is triggered, and booster A is recycled. And the __del__ of Booster is called.

LightGBM/python-package/lightgbm/basic.py

Lines 2826 to 2836 in dc4794b

    
           def __del__(self) -> None: 
        
               try: 
        
                   if self.network: 
        
                       self.free_network() 
        
               except AttributeError: 
        
                   pass 
        
               try: 
        
                   if self.handle is not None: 
        
                       _safe_call(_LIB.LGBM_BoosterFree(self.handle)) 
        
               except AttributeError: 
        
                   pass

which will deallocate all the network connections by calling free_network!

LightGBM/python-package/lightgbm/basic.py

Lines 2919 to 2929 in dc4794b

    
               def free_network(self) -> "Booster": 
        
                   """Free Booster's network. 
        
                   Returns 
        
                   ------- 
        
                   self : Booster 
        
                       Booster with freed network. 
        
                   """ 
        
                   _safe_call(_LIB.LGBM_NetworkFree()) 
        
                   self.network = False 
        
                   return self

LightGBM/src/c_api.cpp

Lines 2511 to 2515 in dc4794b

    
           int LGBM_NetworkFree() { 
        
             API_BEGIN(); 
        
             Network::Dispose(); 
        
             API_END(); 
        
           }

Then booster A will train as a single process program since the num_machines_ is set to 1.

LightGBM/src/network/network.cpp

Lines 60 to 66 in dc4794b

    
           void Network::Dispose() { 
        
             num_machines_ = 1; 
        
             rank_ = 0; 
        
             linkers_.reset(new Linkers()); 
        
             reduce_scatter_ext_fun_ = nullptr; 
        
             allgather_ext_fun_ = nullptr; 
        
           }

However, since we are training in a distributed way, there's another process waiting for the response of the process running booster A. And that process gets stuck forever since there's no response from booster A's process.

I've pushed a quick workaround to this branch, which enforces using a new LocalCluster for each test case. However, this seems to make the testing much slower because initializing the processes for a cluster in Dask seems to be slow.

Could anybody familiar with Dask provide a better solution? Shortly speaking, we want to use new processes for distributed training in each test case.

jameslamb · 2022-09-30T13:22:48Z

Wow, great investigation!!! I wonder if this is the cause of reports like #4771 or #4942.

I think creating a new Dask cluster every time is not a good solution...LightGBM's users wouldn't be happy with having to do that, since it would mean that, for example, the workflow of "initialize a Dask DataFrame, keep it in distributed memory with .persist(), then run multiple LightGBM training runs over it" would effectively not be supported.

Instead, could we have each distributed training process allocate its own Network and hold a pointer to it, and then have free_network() only free that specific Network? Or would that cause some problems?

jmoralez · 2022-09-30T15:38:40Z

I agree with @jameslamb on this, users may want to run consecutive trainings in the same process when doing things like hyperparameter tuning. Also this has been the way we run the dask tests for more than a year (changed in #4159), @shiyu1994 can you think of a recent change that would cause this to start failing now?

jmoralez · 2022-09-30T16:12:39Z

I tested on mac and in one of the jobs there are many errors that we've previously seen for mac: lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 54). I guess it has the same root cause, right? The network was freed while training so the connection was lost to one of the machines but mac somehow realizes this and raises the error instead?

Edit:
I also see this one in the logs lightgbm.basic.LightGBMError: Please initialize the network interface first.

So I believe you're definitely right @shiyu1994.

jameslamb · 2022-10-04T03:03:32Z

I'm going to try implementing a fix here where, at the end of training, the Python package explicitly calls free_network().

I still think the ideal solution is the one described in #5510 (comment), because the Dask-only fix I'm proposing has the following drawbacks:

doesn't help solve this issue for non-Dask settings (e.g. SynapseML or lightgbm-ray)
means that only one LightGBM training run can be happening in the cluster at a time

If my proposed fix seems to work, I think we should adopt it to unblock CI and then I'll document this issue in a separate issue that could be worked on.

This reverts commit 7ff172d.

jameslamb · 2022-10-06T16:29:52Z

~~Reducing the time_out for test test_machines_should_be_used_if_provided() from 120 to 1 allowed that test to succeed in a reasonable time (a few seconds)!~~

update: that time_out actually didn't matter.

(link to successful build)

So it seems like the only problematic test might be test_training_succeeds_even_if_some_workers_do_not_have_any_data().

Going to continue investigating that next.

…rors

jameslamb · 2022-10-07T01:25:04Z

I know it is difficult to follow all the debugging comments and commits here, but CI is passing and this PR is ready for review 🎉

After switching from _LIB.LGBM_NetworkFree() to Booster.free_network() in the Dask module (to avoid a second call to _LIB.LGBM_NetworkFree() at an unpredictable time when the Booster is garbage cleaned), only one test is still causing timeout issues.

I think we should

merge this PR (which adds a pytest.skip() on that test) to unblock development
separately document the work to investigate and remove that pytest.skip()

@shiyu1994 @guolinke @jmoralez @StrikerRUS

jmoralez

Awesome job! And thanks @shiyu1994 for the insights.

jameslamb · 2022-10-07T02:03:05Z

Thanks to both of you for your help! This was a really difficult one.

It would be interesting in the future to work on changing the strategy for how LGBM_NetworkFree() works, to try to enable multiple concurrent training runs on the same Dask cluster.

I'm going to merge this and start updating / merging some of the other approved PRs, starting with #5506.

shiyu1994 · 2022-10-07T16:53:41Z

Sorry for the late response. Just return from our 1-week national holiday.

jameslamb · 2022-10-07T16:56:15Z

no problem, welcome back! Please look at my comment in #5502 (comment) as soon as possible and respond there...I'm nervous that the R package might be in danger of archiving on CRAN.

…h for cleaning up network (fixes #5509) (#5510)

github-actions · 2023-08-19T03:04:04Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 2 commits September 26, 2022 22:51

[ci] avoid updating dependencies when installing plotting dependencies

1660982

comment out CI

704aea4

jameslamb added the blocking label Sep 27, 2022

jameslamb requested a review from shiyu1994 September 27, 2022 03:53

jameslamb added the maintenance label Sep 27, 2022

jameslamb mentioned this pull request Sep 27, 2022

[ci] Install matplotlib and graphviz when creating conda env #5509

Closed

explicitly pin to cpython

b77d123

jameslamb marked this pull request as ready for review September 27, 2022 13:41

jameslamb requested a review from StrikerRUS as a code owner September 27, 2022 13:41

jameslamb changed the title ~~WIP: [ci] avoid updating dependencies when installing plotting dependencies (fixes #5509)~~ [ci] avoid updating dependencies when installing plotting dependencies (fixes #5509) Sep 27, 2022

Revert "comment out CI"

982f07e

This reverts commit 704aea4.

jameslamb added the awaiting review label Sep 27, 2022

jameslamb changed the title ~~[ci] avoid updating dependencies when installing plotting dependencies (fixes #5509)~~ [ci] prefer CPython in Windows test environment (fixes #5509) Sep 27, 2022

jameslamb mentioned this pull request Sep 28, 2022

[python-package] add install option to enable printing of time costs #5497

Merged

use new cluster for each dask test

7ff172d

shiyu1994 requested a review from jmoralez as a code owner September 30, 2022 12:27

jameslamb mentioned this pull request Sep 30, 2022

[tests][dask] solve timeouts #5505

Closed

Revert "use new cluster for each dask test"

f7ccfbe

This reverts commit 7ff172d.

check if the issue is in closing the cluster

3972fdb

jameslamb added 7 commits October 6, 2022 13:04

maybe print debugging will save us

a52fde3

maybe the cluster is left in a weird state by the tests that raise er…

a10c917

…rors

time out faster

07a349b

maybe which client you use matters

e6a3319

remove remaining workarounds, uncomment CI, skip one problematic test

00d423e

revert a few more unnecessary changes

2284729

try removing timeout on test_machines_should_be_used_if_provided()

82fe5ef

jameslamb requested a review from guolinke October 7, 2022 01:34

jmoralez approved these changes Oct 7, 2022

View reviewed changes

jameslamb merged commit 1d6d3b3 into master Oct 7, 2022

jameslamb deleted the ci/fix-pypy branch October 7, 2022 02:03

jameslamb removed the awaiting review label Oct 7, 2022

This was referenced Oct 7, 2022

[ci] Appveyor builds failing #5507

Closed

[docs] Fix link to Optuna's LightGBMTuner example #5519

Merged

[python-package] prefix is_numpy_1d_array with _ #5520

Merged

This was referenced Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

[ci] [dask] fix test on "only one worker has data" training #5537

Closed

StrikerRUS mentioned this pull request Oct 23, 2022

[ci] prefer CPython in Linux and macOS test environment #5555

Merged

StrikerRUS removed the blocking label Oct 23, 2022

jameslamb added a commit that referenced this pull request Dec 6, 2022

[ci] prefer CPython in Windows test environment and use safer approac…

e5da4fa

…h for cleaning up network (fixes #5509) (#5510)

jameslamb mentioned this pull request Jan 9, 2023

WIP: [dask] [python-package] clean up failed futures if network setup fails (fixes #5390) #5597

Closed

jameslamb mentioned this pull request May 3, 2023

[Dask] Race condition in finding ports #5865

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

jameslamb commented Sep 27, 2022 •

edited

jameslamb commented Sep 27, 2022

jmoralez commented Sep 27, 2022

jameslamb commented Sep 27, 2022

jameslamb commented Sep 28, 2022

shiyu1994 commented Sep 29, 2022

shiyu1994 commented Sep 29, 2022

jameslamb commented Sep 29, 2022

shiyu1994 commented Sep 30, 2022

jameslamb commented Sep 30, 2022

jmoralez commented Sep 30, 2022

jmoralez commented Sep 30, 2022 •

edited

jameslamb commented Oct 4, 2022

jameslamb commented Oct 6, 2022 •

edited

jameslamb commented Oct 7, 2022 •

edited

jmoralez left a comment

jameslamb commented Oct 7, 2022

shiyu1994 commented Oct 7, 2022

jameslamb commented Oct 7, 2022

github-actions bot commented Aug 19, 2023

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

Conversation

jameslamb commented Sep 27, 2022 • edited

jameslamb commented Sep 27, 2022

jmoralez commented Sep 27, 2022

jameslamb commented Sep 27, 2022

jameslamb commented Sep 28, 2022

shiyu1994 commented Sep 29, 2022

shiyu1994 commented Sep 29, 2022

jameslamb commented Sep 29, 2022

shiyu1994 commented Sep 30, 2022

jameslamb commented Sep 30, 2022

jmoralez commented Sep 30, 2022

jmoralez commented Sep 30, 2022 • edited

jameslamb commented Oct 4, 2022

jameslamb commented Oct 6, 2022 • edited

jameslamb commented Oct 7, 2022 • edited

jmoralez left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 7, 2022

shiyu1994 commented Oct 7, 2022

jameslamb commented Oct 7, 2022

github-actions bot commented Aug 19, 2023

jameslamb commented Sep 27, 2022 •

edited

jmoralez commented Sep 30, 2022 •

edited

jameslamb commented Oct 6, 2022 •

edited

jameslamb commented Oct 7, 2022 •

edited